| GENE H. GOLUB - CHARLES F. VAN LOAN | 


MATRIX 


CINUP'UZTIAQREOT: S 


Matrix Computations EE m 
Johns Hopkins Studies in the Mathematical Sciences 


in association with the Department of Mathematical Sciences 
THIRD EDITION The Johns Hopkins University 


Gene H. Golub 


Department of Computer Science 
Stanford University 


Charles F. Van Loan 


Department of Computer Science 
Cornell University 


The Johns Hopkins University Press 
Baitimore and London 


©1983, 1989, 1996 The Johns Hopkins University Press 
All rights reserved. Published 1996 

Printed in the United States of America on acid-free paper 
05 04 03 02 01 00 99 98 97 5432 


First edition 1983 
Second edition 1989 
Third Edition 1996 


The Johns Hopkins University Press 
2715 North Charles Street 

Baltimore, Maryland 21218-4319 

The Johns Hopkins Press Ltd., London 


Library of Congress Cataloging-in-Publication Data will be found 


at the end of this hook. 


A catalog record for this book is available from the British Library. 


ISBN 0-8018-5413-X 
ISBN 0-8018-5414-8  (pbk.) 


DEDICATED TO 


ALSTON S5. HOUSEHOLDER 
AND 


JAMES H. WILKINSON 


Contents 


Preface to the Third Edition xi 
Software xiii 
Selected References xv 


Matrix Multiplication Problems 1 


Basic Algorithms and Notation 2 
Exploiting Structure 16 

Block Matrices and Algorithms 24 
Vectorization and Re-Use Issues 34 


Matrix Analysis 48 


Basic Ideas from Linear Algebra 48 

Vector Norms §2 

Matrix Norms 54 

Finite Precision Matrix Computations 59 
Orthogonality and the SVD 69 

Projections and the CS Decomposition 75 
The Sensitivity of Square Linear Systems 80 


General Linear Systems 87 
Triangular Systems 88 

The LU Factorization 94 

Roundoff Analysis of Gaussian Elimination 104 

Pivoting 109 

Improving and Estimating Accuracy 123 


vH 


8.1 
8.2 


Special Linear Systems 


The LDMT and LDLT Factorizations 135 
Positive Definite Systems 140 

Banded Systems 152 

Symmetric Indefinite Systems 161 

Block Systems 174 

Vandermonde Systems and the FFT 183 
Toeplitz and Related Systems 193 


Orthogonalization and Least Squares 


Householder and Givens Matrices 208 
The QR Factorization 223 

The Full Rank LS Problem 236 

Other Orthogonal Factorizations 248 

The Rank Deficient LS Problem 256 
Weighting and Iterative Improvement 264 
Square and Underdetermined Systems 270 


Parallel Matrix Computations 


Basic Concepts 276 
Matrix Multiplication 292 
Factorizations 300 


The Unsymmetric Eigenvalue Problem 


Properties and Decompositions 310 
Perturbation Theory 320 

Power Iterations 330 

The Hessenberg and Real Schur Forms 341 
The Practical QR Algorithm 352 

Invariant Subspace Computations 362 

The QZ Method for Ax = A Bx 375 


The Symmetric Eigenvalue Problem 


Properties and Decompositions 393 
Power Iterations 405 


133 


206 


275 


308 


391 


The Symmetric QR Algorithm 414 

Jacobi Methods 426 

Tridiagonal Methods 439 

Computing the SVD 448 

Some Generalized Eigenvalue Problems 461 


Lanczos Methods 


Derivation and Convergence Properties 471 
Practica! Lanczos Procedures 479 
Applications to Ax = b and Least Squares 490 
Arnoldi and Unsymmetric Lanczos 499 


10 Iterative Methods for Linear Systems 


10.1 
10.2 
10.3 
10.4 


The Standard Iterations — 509 

The Conjugate Gradient Method 520 
Preconditioned Conjugate Gradients 532 
Other Krylov Subspace Methods 544 


1 1 Functions of Matrices 


11.1 
11.2 
11.3 


Eigenvalue Methods 556 
Approximation Methods 562 
The Matrix Exponential 572 


12 Special Topics 


12.1 
12.2 
12.3 
12.4 
12.5 
12.6 


Constrained Least Squares 580 

Subset Selection Using the SVD 590 
Total Least Squares 595 

Computing Subspaces with the SVD — 601 
Updating Matrix Factorizations 606 
Modified/Structured Eigenproblems 621 


Bibliography 637 


Index 


687 


470 


508 


555 


579 


Preface to the Third Edition 


The field of matrix computations continues to grow and mature. In 
the Third Edition we have added over 300 new references and 100 new 
problems. The LINPACK and EISPACK citations have been replaced with 
appropriate pointers to LAPACK with key codes tabulated at the beginning 
of appropriate chapters. 

In the First Edition and Second Edition we identified a small number 
of global references: Wilkinson (1965), Forsythe and Moler (1967), Stewart 
(1973), Hanson and Lawson (1974) and Pariett (1980). These volumes are 
as important as ever to the research landscape, but there are some mag- 
nificent new textbooks and monographs on the scene. See The Literature 
section that follows. 

We continue as before with the practice of giving references at the end 
of each section and a master bibliography at the end of the book. 

The earlier editions suffered from a large number of typographical errors 
and we are obliged to the dozens of readers who have brought these to our 
attention. Many corrections and clarifications have been made. 

Here are some specific highlights of the new edition. Chapter 1 (Matrix 
Multiplication Problems) and Chapter 6 (Parallel Matrix Computations) 
have been completely rewritten with less formality. We think that this 
facilitates the building of intuition for high performance computing and 
draws a better line between algorithm and implementation on the printed 
page. 

In Chapter 2 (Matrix Analysis) we expanded the treatment of CS de- 
composition and included a proof. The overview of floating point arithmetic 
has been brought up to date. In Chapter 4 (Special Linear Systems) we 
embellished the Toeplitz section with connections to circulant matrices and 
the fast Fourier transform. A subsection on equilibrium systems has been 
included in our treatment of indefinite systems. 

A more accurate rendition of the modified Gram-Schmidt process is 
offered in Chapter 5 (Orthogonalization and Least Squares). Chapter 8 
(The Symmetric Eigenproblem) has been extensively rewritten and rear- 
ranged so as to minimize its dependence upon Chapter 7 (The Unsymmet- 
ric Eigenproblem). Indeed, the coupling between these two chapters is now 
so minimal that it is possible to read either one first. 

In Chapter 9 (Lanczos Methods) we have expanded the discussion of 


xi 


xli FREFACE 10 THe 1HIRD LLDiii0A 


the unsymmetric Lanczos process and the Arnoldi iteration. The "unsym- 
metric component” of Chapter 10 (Iterative Methods for Linear Systems) 
has likewise been broadened with & whole new section devoted to various 
Krylov space methods designed to handle the sparse unsymmetric linear 
system problem. 

In 812.5 (Updating Orthogonal Decompositions) we included a new sub- 
section on ULV updating. Toeplitz matrix eigenproblems and orthogonal 
matrix eigenproblems are discussed in §12.6. 

Both of us look forward to continuing the dialog with our readers. As 
we said in the Preface to the Second Edition, “It has been a pleasure to 
deal with such an interested and friendly readership.” 

Many individuals made valuable Third Edition suggestions, but Greg 
Ammar, Mike Heath, Nick Trefethen, and Steve Vavasis deserve special 
thanks. 

Finally, we would like to acknowledge the support of Cindy Robinson 
at Cornell. A dedicated assistant makes a big difference. 


Software 


LAPACK 


Many of the algorithms in this book are implemented in the software pack- 
age LAPACK: 


E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, 
A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. 
Sorensen (1995). LAPACK Users' Guide, Release 2.0, 2nd ed., SIAM 
Publications, Philadelphia. 


Pointers to some of the more important routines in this package are given 
at the beginning of selected chapters: 


Chapter 1. Level-1, Level-2, Level-3 BLAS 

Chapter 3. General Linear Systems 

Chapter 4. Positive Definite and Band Systems 

Chapter 5. Orthogonalization and Least Squares Problems 
Chapter 7. The Unsymmetric Eigenvalue Problem 
Chapter 8. The Symmetric Eigenvalue Problem 


Our LAPACK references are spare in detail but rich enough to “get you 
started." Thus, when we say that .TRSV can be used to solve a triangular 
system Az = b, we leave it to you to discover through the LAPACK manual 
that A can be either upper or lower triangular and that the transposed 
system ATz = b can be handled as well. Moreover, the underscore is a 
placeholder whose mission is to designate type (single, double, complex, 
etc). 

LAPACK stands on the shoulders of two other packages that are mile- 
stones in the history of software development. EISPACK was developed in 
the early 1970s and is dedicated to solving symmetric, unsymmetric, and 
generalized eigenproblems: 


B.T. Smith, J.M. Boyle, Y. Ikebe, V.C. Klema, and C.B. Moler (1970). 
Matriz Eigensystem Routines: EISPACK Guide, 2nd ed., Lecture Notes 
in Computer Science, Volume 6, Springer-Verlag, New York. 


see 
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B.S. Garbow, J.M. Boyle, J.J. Dongarra, and C.B. Moler (1972). Matriz 
Figensystem Routines: EISPACK Guide Extension, Lecture Notes in 
Computer Science, Volume 51, Springer-Verlag, New York. 


LINPACK was developed in the late 1970s for linear equations and least 
squares problems: 


EISPACK and LINPACK have their roots in sequence of papers that feature 
Algol implementations of some of the key matrix factorizations. These 
papers are collected in 


J.H. Wilkinson and C. Reinsch, eds. (1971). Handbook for Automatic 
Computation, Vol. 2, Linear Algebra, Springer-Verlag, New York. 


NETLIB 


À wide range of software including LAPACK, EISPACK, and LINPACK is 
available electronically via Netlib: 


World Wide Web: http://www .netlib.org/index.html 
Anonymous ftp: ftp://ftp.netlib.org 


Via email, send a one-line message: 


mail netlibddorni. gov 
send index 


to get started, 
MATLAB® 


Complementing LAPACK and defining a very popular matrix computation 
enviroament is MATLAB: 


MATLAB User's Guide, The Math Works Inc., Natick, Massachusetts. 
M. Marcus (1993). Matrices and MATLAB: A Tutorial, Prentice Hall, Up- 
per Saddle River, NJ. 


R. Pratap (1995). Getting Started with MATLAB, Saunders College Pub- 
lishing, Fort Worth, TX. 


Many of the problems in Matriz Computations are best posed to studenta 
as MATLAB problems. We make extensive use of MATLAB notation in the 
presentation of algorithms. 


Selected References 


Each section in the book concludes with an annotated list of references. 
A master bibliography is given at the end of the text. 

Useful books that collectively cover the field, are cited below. Chapter 
titles are included if appropriate but do not infer too much from the level 
of detail because one author's chapter may be another's subsection. The 
citations are classified as follows: ` 


Pre-1970 Classics. Early volumes that set the stage. 
Introductory (General). Suitable for the undergraduate classroom. 
Advanced (General). Best for practitioners and graduate students. 
Analytical. For the supporting mathematics. 

Linear Equation Problems. Az — b. 

Linear Fitting Problems. Ar. 

Eigenvalue Problems. Ar = Ar. 

High Performance. Parallel/vector issues. 

Edited Volumes. Useful, thematic collections. 


Within each group the entries are specified in chronological order. 
Pre-1970 Classics 


V.N. Faddeeva (1958). Computational Methods of Linear Algebra, Dover, 
New York. 
Basic Material from Linear Algebra. Systeme of Linear Equations. The Proper 
Numbers and Proper Vectors of a Matrix. 


E. Bodewig (1959). Matriz Calculus, North Holland, Amsterdam. 


Matrix Calculus. Direct Methoda for Linear Equations. Indirect Methods for Linear 
Equations. Inversion of Matrices, Geodetic Matrices. Elgonproblems. 


R.S. Varga (1962). Matriz Iterative Analysis, Prentice-Hall, Englewood 
Cliffs, NJ. 
Matrix Properties and Concepts. Nonnegative Matrices. Basic Iterative Methods 
and Comparison Theorems. Successive Overrelaxation Iterative Methods. Semi- 
Iterative Methods, Derivation and Solution of Elliptic Difference Equations. Alter- 
nating Direction Implicit Iterative Methods. Matrix Methods for Parabolic Partial 
Differential Equations, Estimation of Acceleration Parameters. 
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J.H. Wilkinson (1963). Rounding Errors in Algebraic Processes, Prentice- 
Hail, Englewood Cliffs, NJ. 


The Fundamental Arithmetic Operations. Computations Involving Polynomials. 
Matrix Computations. 


A.S. Householder (1964). Theory of Matrices in Numerical Analysis, Blais-- 
dell, New York. Reprinted in 1974 by Dover, New York. 


Some Basic Identities and Inequalities. Norms, Bounds, and Convergence. Localiza- 
tion Theorems and Other Inequalities. The Solution of Linear Systems: Methods of 
Successive Approximation. Direct Methods of inversion. Proper Values and Vectors: 
Normalization and Reduction of the Matrix. Proper Values and Vectors: Successive 
Approximation. 


L. Fox (1964). An Introduction to Numerical Linear Algebra, Oxford Uni- 
versity Press, Oxford, England. 


Introduction, Matrix Algebra. Elimination Methods of Gauss, Jordan, and Aitken. 
Compact Elimination Methods of Dooiittle, Crout, Banachiewicz, and Cholesky. 
Orthogonalization Methods. Condition, Accuracy, and Precision. Comparison of 
Methods, Measure of Work. Iterative and Gradient Methods. Iterative methods for 
Latent Roots and Vectors. Transformation Methods for Latent Roots and Vectors. 
Notes on Error Analysis for Latent Roots and Vectors. 


J.H. Wilkinson (1965). The Algebraic Eigenvalue Problem, Clarendon Press, 
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Theoretical Background. Perturbation Theory. Error Analysis. Solution of Lin- 
ear Algebraic Equations. Hermitian Matrices. Reduction of a General Matrix to 
Condensed Form. Eigenvalues of Matrices of Condensed Forms. The LR and QR 
Algerithms. Iterative Methods. 


G.E. Forsythe and C. Moler (1967). Computer Solution of Linear Algebraic 
Systems, Prentice-Hall, Englewood Cliffs, NJ. 


Reader's Background and Purpose of Book. Vector and Matrix Norms. Diagonal 
Form of a Matrix Under Orthogonal Equivalence. Proof of Diagonal Form Theorem. 
Types of Computational Problems in Linear Algebra. Types of Matrices encoun- 
tered in Practical Problems. Sources of Computational Problems of Linear Algebra. 
Condition of a Linear System. Gaussian Elimination and LU Decomposition. Need 
for Interchanging Rows. Scaling Equations and Unknowns. The Crout and Doolit- 
tle Variants. Iterative Improvement. Computing the Determinant. Nearly Singular 
Matrices. Algol 60 Program. Fortran, Extended Algol, and PL/I Programs. Ma- 
trix Inversion. An Example: Hilbert Matrices. Floating Point Round-Off Analysis. 
Rounding Error in Gaumian Elimination. Convergence of Iterative Improvement. 
Positive Definite Matrices; Band Matrices. Iterative Methods for Solving Linesr 
Systems. Nonlinear Systems of Equations. 
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Matrix Computations 


Chapter 1 


Matrix Multiplication 
Problems 


$1.1 Basic Algorithms and Notation 

§1.2 Exploiting Structure 

§1.3 Block Matrices and Algorithms 

81.4  Vectorization and Re-Use Issues 


The proper study of matrix computations begins with the study of the 
matrix-matrix multiplication problem. Although this problem is simple 
mathematically it is very rich from the computational point of view. We 
begin in $1.1 by lóoking at the several ways that the matrix multiplica- 
tion problem can be organized. The “language” of partitioned matrices 
is established and used to characterize several linear algebraic “levels” of 
computation. 

If a matrix has structure, then it is usually possible to exploit it. For 
example, a symmetric matrix can be stored in half the space as a general 
matrix. A matrix-vector product that involves a matrix with many zero 
entries may require much less time to execute than a full matrix times a 
vector. These matters are discussed in $1.2. 

In $1.3 block matrix notation is established. A block matrix is a matrix 
with matrix entries. This concept is very important from the standpoint of 
both theory and practice. On the theoretical side, block matrix notation 
allows us to prove important matrix factorizations very succinctly. These 
factorizations are the cornerstone of numerical linear algebra. From the 
computational point of view, block algorithms are important because they 
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are rich in matrix multiplication, the operation of choice for many new high 
performance computer architectures. 

These new architectures require the algorithm designer to pay as much 
attention to memory traffic as to the actual amount of arithmetic. This 
aspect of scientific computation is illustrated in $1.4 where the critical is- 
sues of vector pipeline computing are discussed: stride, vector length, the 
number of vector loads and stores, and the level of vector re-use. 


Before You Begin 


It is important to be familiar with the MATLAB language. See the 
texts by Pratap(1995) and Van Loan (1996). A richer introduction to high 
performance matrix computations is given in Dongarra, Duff, Sorensen, and 
Duff (1991). This chapter’s LAPACK connections include 


LAPACK: Some General Operations 
rear 


n caly 


yoorty 

yr adz + By Matrix-vector multiplication 
A A- ary? | Rank-1 update 

C — aAB -- 9C | Matrix multiplication 


LAPACK: Some Symmetric Operations 
ym aAz + By 
y c aàr + By 
A= arr? +A 
A — axyT tayrT +A 
C — aAAT + 8C 
C — aABT + aBAT +90 


Matrix-vector multipücation 
Matrix-vector multiplication (Packed) 
Rank-1 update 

Rank-2 update 

Rank-k update 

Rank-2k update 

Symmetric/General Product 


B — oAB (or BA) | Triangular/General Product 


1.1 Basic Algorithms and Notation 


Matrix computations are built upon a hierarchy of linear algebraic opera- 
tions. Dot products involve the scalar operations of addition and multipli- 
cation. Matrix-vector multiplication is made up of dot products. Matrix- 
matrix mukiplication amounts to a collection of matrix-vector products. 
All of these operations can be described in algorithmic form or in the lan- 
guage of linear algebra. Our primary objective in this section is to show 
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how these two styles of expression complement each another. Along the way 
we pick up notation and acquaint the reader with the kind of thinking that 
underpins the matrix computation area. The discussion revolves around 
the matrix multiplication problem, a computation that can be organized in 
several ways. 

1.1.1 Matrix Notation 


Let IR denote the set of real numbers, We denote the vector space of all 
m-by-n real matrices by IR™*": 


dij +++ 1s 
AER™ c Ac-(aj-|: : a4 € R. 
Umi c^ mn 
If a capital letter is used to denote a matrix (e.g. A, B, A), then the 
corresponding lower case letter with subscript ij refers to the (i, j} entry 


(e.g, a; , bj, 63). As appropriate, we also use the notation Í A ];; and 
A(i, j) to designate the matrix elements. 


1.1.2 Matrix Operations 
Basic matrix operations include transposition (R™*" -= R"™™), 


C= AT — Cj = ji 
addition (R™*" x R™** | Rn) 

C=A+B = Gy = ay + bijs 
scalar-matriz multiplication, (Et x R™*" — R™*"), 

C= aA = Cij = Alij, 
and matrir-matrir multiplication (R™*? x RP*"™ — R™™"), 


r 
C= AB = os; = Y aibes- 
kat 


These are the building blocks of matrix computations. 
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1.1.3 Vector Notation 
Let R" denote the vector space of real n-vectors: 
E) 
relm? = z=]: ER. 
Tn 


We refer to x; as the ith component of r. Depending upon context, the 
alternative notations [z]; and x(t) are sometimes used. 

Notice that we are identifying IR^ with IR"! and so the members of 
IR? are column vectors. On the other hand, the elements of IR'*” are row 
vectors: 

xzcR'^ 4 x = (2t1,...,2n). 


If z is a column vector, then y = z^ is a row vector. 


1.1.4 Vector Operations 


Assume a € R, z c R^, and y € R”. Basic vector operations include scalar- 
vector multiplication, 


z-ar => Zi = ary, 


vector addition, 


H 


z=r+y = Zi = tity 


the dot product (or inner product), 


n 
c=aty = c= Y xis 
ini 


and vector multiply (or the Hadamard product) 


zc-mey => y= IM. 


Another very important operation which we write in “update form” is the 
sozpy. 
yrarty = Wraith 

Here, the symbol “=” is being used to denote assignment, not mathematical 
equality. The vector y is being updated. The name “saxpy” is used in 
LAPACK, a software package that implements many of the algorithms in 
this book. One can think of “saxpy” as a mnemonic for “scalar a r plus 
y^ 


1.1. Basic ALGORITHMS AND NOTATION 5 


1.1.5 The Computation of Dot Products and Saxpys 


We have chosen to express algorithms in a stylized version of the MATLAB 
language. MATLAB is a powerful interactive system that is ideal for matrix 
computation work. We gradually introduce our stylized MATLAB notation 
in this chapter beginning with an algorithm for computing dot products. 


Algorithm 1.1.1 (Dot Product) If z,y € IR", then this algorithm com- 
putes their dot product c = zT y. 

c=0 

for i= im 

c=e+a(i)y(i) 

end 
The dot product of two n-vectors involves n multiplications and n additions. 
It is an “O(n)” operation, meaning that the amount of work is linear in 
the dimension. The saxpy computation is also an O(n) operation, but it 
returns a vector instead of a scalar. 


Algorithm 1.1.2 (Saxpy) If z,y € IR" and a € IR, then this algorithm 
overwrites y with ar + y. 


for i= Ln 
y(i) = az(1) + y(i) 
end 


It must be stressed that the algorithms in this book are encapsulations of 
critical computational ideas. amd not "production codes.” 


1.1.6 Matrix-Vector Multiplication and the Gaxpy 
Suppose A € R™*™ and that we wish to compute the update 
y= Ar+y 


where z € R” and y € R™ are given. This generalized saxpy operation is 
referred to as a gatpy. A standard way that this computation proceeds is 
to update the components one at a time: 


n 
yi = 3 ayz + yi i= lm. 
j=l 


This gives the following algorithm. 


Algorithm 1.1.3 (Gaxpy: Row Version) If A € IR?*^, z € IR^, and 
y € R”, then this algorithm overwrites y with Ar + y. 
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for i= Ll: m 
for j = l:n 
y(i) = A(i, )z(3) + y(i) 
end 
end 


An alternative algorithm results if we regard Az as a linear combination of 
A's columns, e.g., 


12 T 1.74-2.8 1 2 23 
3 4 [3 |= 3-74+4°8 |=7| 3 ]/4+8)] 4| =] 53]. 
5 6 5-7+6-8 5 6 83 


Algorithm 1.1.4 (Gaxpy: Column Version) If 4 c R"*^, re IR^, 
and y € IR”, then this algorithm overwrites y with Az + y. 


for j —- Un 
for i = lm 
yli) = Ali, iei) + yl} 
end 
end 


Note that the inner loop in either gaxpy algorithm carries out a saxpy 
operation. The column version was derived by rethinking what matrix- 
vector multiplication “means” at the vector level, but it could also have 
been obtained simply by interchanging the order of the loops in the row 
version. In matrix computations, it is important to relate loop interchanges 
to the underlying linear algebra. 


1.1.7 Partitioning a Matrix into Rows and Columns 


Algorithms 1.1.3 and 1.1.4 access the data in A by row and by column 
respectively. To highlight these orientations more clearly we introduce the 
language of partitioned matrices. 

From the row point of view, a matrix is a stack of row vectors: 


Ae = A=]: ER. (1.1.1) 
t: 


This is called a row partition of A. Thus, if we row partition 


1 2 
3 4|, 
5.6 
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then we are choosing to think of A as a collection of rows with 
rf-[1 2]  rf-(3 4], and ri-[5 6]. 


With the row partitioning (1.1.1) Algorithm 1.1.3 can be expressed as fol- 
lows: 


fori-im 
y = riz + y(i) 
end 


Alternatively, a matrix is a collection of column vectors: 
AER™" <=) Az [ei,....en], ER”. (1.1.2) 


We refer to this as a column partition of A. In the 3-by-2 example above, we 
thus would set cı and cz to be the first and second columns of A respectively: 


With (1.1.2) we see that Algorithm 1.1.4 is a saxpy procedure that accesses 
A by columns: 


for j = lin 
Y = Tic ty 
end 


In this context appreciate y as a running vector sum that undergoes re- 
peated saxpy updates. 


1.1.8 The Colon Notation 


À handy way to specify a column or row of a matrix is with the "colon" 
notation. If A € IR"*^, then A(k,:) designates the kth row, i.e., 


A(k,:) = [ak1,.... 0x0] . 
The Ath column is specified by 
lk 
A(n k) = 
Omk 
With these conventions we can rewrite Algorithms 1.1.3 and 1.1.4 as 


for i = 1mm 
y(i) = A(i,:)z + y(t) 
end 
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and 


for j = im 
y -a0)AC, 3) * v 


end 


respectively. With the colon notation we are able to suppress iteration 
details. This frees us to think at the vector level and focus on larger com- 
putational issues. 


1.1.9 The Outer Product Update 


Ás a preliminary application of the colon notation, we use it to understand 
the outer product update 


AzA-rzTJ, AeR™" reR” yeR". 


The outer product operation zy? “looks funny" but is perfectly legal, e.g., 


1 4 6$ 
HE i-i JJ 
3 12 15 


This is because vy” is the product of two "skinny" matrices and the number 
of columns in the left matrix z equals the number of rows in the right matrix 
yT. The entries in the outer product update are prescribed by 


for i = 1:m 
for j= l:n 
tij = Gig + Tuy; 
end 
end 


The mission of the j loop is to add a multiple of y7 to the i-th row of A, 


i.e., 


for i= 1:n 
A(i,:) = A(i,:) + z(i)yT 
end 


On the other hand, if we make the i-loop the inner loop, then its task is to 
add a multiple of r to the jth column of A: 


for j = i:n 
AG, 3) = AC, j) + vG)z 
end 


Note that both outer product algorithms amount to a set of saxpy updates. 
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1.1.10  Matrix-Matrix Multiplication 


Consider the 2-by-2 matrix-matrix multiplication AB. In the dot product 
formulation each entry is computed as a dot product: 


12 5 6] [|1:542.7 1.62.8 
3 4 T 8] | 3-544-7 3-6+4-8 |” 
In the saxpy version each column in the product is regarded as a linear 


combination of columns of A: 
1 2 5 6 1 2 1 2 
al st=[sts]+7[e] sile] 


Finally, in the outer product version, the result is regarded as the sum of 
outer products: 


[3 HIE sl- lis 61+] 4 Jur 8]. 


Although equivalent mathematically, it turns out that these versions of 
matrix multiplication can have very different levels of performance because 
of their memory traffic properties. This matter is pursued in $1.4. For now, 
it is worth detailing the above three approaches to matrix multiplication 
because it gives us a chance to review notation and to practice thinking at 
different linear algebraic levels. 


1.1.41 Scalar-Level Specifications 
To fix the discussion we focus on the following matrix multiplication update: 


C=AB+C  AGR'"'BeRB"",CcmR""", 
The starting point is the familiar triply-nested loop algorithm: 
Algorithm 1.1.5 (Matrix Multiplication: ijk Variant) If A c R™*?, 


B € RP and C € IR"*" are given, then this algorithm overwrites C with 
AB 4 C. 


for i = Ll: m 
for j= 1:n 
for k = |:p 
Cli, j) = AG, K)B(, j) + CG, j) 
end 
end 
end 
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This is the “ijk variant" because we identify the rows of C (and A) with i, 
the columns of C (and B) with j, and the summation index with k. 

We consider the update C = AB + C instead of just C = AB for two 
reasons. We do not have to bother with C = 0 initializations and updates 
of the form C = AB + C arise more frequently in practice. 

The three loops in the matrix multiplication update can be arbitrarily 
ordered giving 3! — 6 variations. Thus, 


for j = l:n 
for k = Lp 
for i=1:m 
C(i, j) = A(i, k)}B(k, J) + CU, j) 
end 
end 
end 


is the jki variant. Each of the six possibilities (ijk, jik, ikj, jki, kij, 
kji) features an inner loop operation (dot product or saxpy) and has its 
own pattern of data flow. For example, in the ijk variant, the inner loop 
oversees a dot product that requires access to a row of A and a column of 
B. The jki variant involves a saxpy that requires access to a column of C 
and a column of A. These attributes are summarized in Table 1.1.1 along 
with an interpretation of what is going on when the middle and inner loop 
are considered together. Each variant involves the same amount of floating 


Loop Inner Middle Inner Loop 
Order Loop Loop Data Access 
vector x matrix A by row, B by column 
matrix x vector A by row, B by column 
TOW gaxpy B by row, C by row 
column gaxpy A by column, C by column 
row outer product B by row, C hy row 


column outer product | A by column, C by column 


TABLE 1.1.1. Matrir Multiplication: Loop Orderings and Properties 


point arithmetic, but accesses the A, B, and C data differently. 


1.1.12 A Dot Product Formulation 


The usual matrix multiplication procedure regards AB as an array of dot 
products to be computed one at a time in left-to-right, top-to-bottom order. 
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This is the idea behind Algorithm 1.1.5. Using the colon notation we can 
highlight this dot-product formulation: 


Algorithm 1.1.6 (Matrix Multiplication: Dot Product Version) 
If Ac R^*?, B c IRP*^, and C € R™*" are given, then this algorithm 
overwrites C with AB +C. 


for i= 1:m 
for j = l:n 
Cli, j) = Ali, )BC, 9) + C(*.3) 
end 
end 
In the language of partitioned matrices, if 
af 
Az]: ay € IR? 
ah 


and 
B = [5,...,64] b, € RP 
then Algorithm 1.1.6 has this interpretation: 
for i = lim 
for j =i:n 
cay = afb; + es 
end 
end 


Note that the “mission” of the j-loop is to compute the ith row of the 
update. To emphasize this we could write 
for i= 1:m. 
P= of B+ 
end  - 


where 


C= : 
om 
is a row partitioning of C. To say the same thing with the colon notation 
we write 
for i = km 
C(i,:) = AG, JB - C(i,:) 
end 


Either way we see that the inner two loops of the ijk variant define a 
row-oriented gaxpy operation. 
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1.1.13 A Saxpy Formulation 
Suppose A and C are column-partitioned as follows 

A= [41...-, 2p] a; € R” 


C 


[cran] c; eR”. 


By comparing jth columns in C = AB + C we see that 


P 
cj = y» + Cj, jalan. 
k=l 


‘These vector sums can be put together with a sequence of saxpy updates. 


Algorithm 1.1.7 (Matrix Multiplication: Saxpy Version) If the ma- 
trices A € IR™*?, B c RP", and C e IR^" are given, then this algorithm 
overwrites C with AB + C. 


for j = 1m 
for k = l:p 
C(x j) = AG K)B(k, j) + C(,J) 
end 
end 


Note that the k-loop oversees a gaxpy operation: 


for j = im 
Ch, j) = ABC, 7) + CC, 9) 
end 


1.1.14 An Outer Product Formulation 
Consider the kij variant of Algorithm 1.1.5: 


for k = lip 
for j = i:n 
for i = im 
Cli, j) = AG, k)B(k, j) + CG, j) 
end 
end 
end 


The inner two loops oversee the outer product update 


C 2a +C 
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here 
T i 
Aw=([ai,...,ap)] and B=] : (1.1.3) 
5 


with a, € R” and by € RR". We therefore obtain 


Algorithm 1.1.8 (Matrix Multiplication: Outer Product Version) 
If Ac R"*?, Be IRP*", and Ce R™”” are given, then this algorithm 
overwrites C with AH + C. 


for k = lp 
C = Á(,k)B(k,:) - C 
end 


This implementation revolves around the fact that AB is the sum of p outer 
products. 


1.1.15 The Notion of "Level" 


The dot product and saxpy operations are examples of "level-1" operations. 
Level-1 operations involve an amount of data and an amount of arithmetic 
that is linear in the dimension of the operation. An m-by-n outer product 
update or gaxpy operation involves a quadratic amount of data (O(mn)) 
and a quadratic amount of work (O(mn)). They are examples of “level-2” 
operations. 

The matrix update C = AB + C is a "level-3" operation. Level-3 
operations involve a quadratie amount of data and a cubic amount of work. 
If A, B, and C are n-by-n matrices, then C = AB + C involves O(n?) 
matrix entries and O(n?) arithmetic operations. 

The design of matrix algorithms that are rich in high-level linear al- 
gebra operations is a recurring theme in the book. For example, a high 
performance linear equation solver may require a level-3 organization of 
Gaussian elimination. This requires some algorithmic rethinking because 
that method is usually specified in level-1 terms, e.g., “multiply row 1 by a 
scalar and add the result to row 2." 


1.1.16 A Note on Matrix Equations 


In striving to understand matrix multiplication via outer products, we es- 
sentially established the matrix equation 


p 
AB =} ab 
kml 


where the a, and b, are defined by the partitionings in (1.1.3). 
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Numerous matrix equations are developed in subsequent chapters. Some- 
times they are established algorithmically like the above outer product ex- 
pansion and other times they are proved at the ij-component level. As 
an example of the latter, we prove an important result that characterizes 
transposes of products. 


Theorem 1.1.1 if A c R™*? and B € RP", then (AB)? = BT AT, 
Proof. It C = (AB)T, then 


ej = [(AB)T]; = [AB] = 3 anb . 


kæli 


On the other hand, if D = BT AT, then 


p 2 
dy = [BT AT ig = 5 "(87 Ja (AT; = So briti 


k=l kml 
Since oj; = di; for all i and 7, it follows that C = D. O 


Scalar-level proofs such as this one are usually not very insightful. However, 
they are sometimes the only way to proceed. 


1.1.17 Complex Matrices 


From time to time computations that involve complex matrices are dis- 
cussed. The vector space of m-by-n complex matrices is designated by 
c™*", The scaling, addition, and multiplication of complex matrices corre- 
sponds exactly to the real case. However, transposition becomes conjugate 
transposition: 
C= AH = «= ay; . 

The vector space of complex n-vectors is designated by C". The dot product 
of complex n-vectors r and y is prescribed by 


n 
sarty= Yaw. 
i-1 
Finally, if A= B + iC € C™*", then we designate the real and imaginary 
parts of A by Re(A) = B and Im(A) = C respectively. 
Problems 


P1.1.1 Suppose A c FOY” and r € RY are given. Give a saxpy algorithm for computing 
the first column of M = (A — zI) --- (A — xri). 
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P1.1.2 In the conventional 2-by-2 matrix multiplication C = AB, there are eight 
multiplications: 031541, 811012, 421511, 21512, à12621, 212522, azaba, and a33535. Make 
a table that indicates the order that these multiplications are performed for the ijk, jik, 
kij, ikj, jki, and kji matrix multiply algorithms. 

P1.1.3 Give an algorithm for computing C = (zy7)^ where z and y are n-vectora. 
P1.1.4 Specify an algorithm for computing (XY7)* where X, Y c En*?. 


P1.1.5 Formulate an outer product algorithm for the update C = ABT + C where 
AE RT", Be R?*", and C € R"*", 


P1.1.8 Suppose we have real n-by-n matrices C, D, E, and F. Show how to compute 
real n-by-n matrices A and B with just three real n-by-n matrix multiplications so that 
(A + iB) = (C -iD)(E-iF). Hint: Compute W = (C + D)(E — F}. 


Notes and References for Sec. 1.1 


It must be stressed that the development of quality software from any of our “semi- 
formal” algorithraic presentations is a long and arduous task. Even the implementation 
of the level-1,2, and 3 BLAS require care: 


C.L. Lawson, R.J. Hanson, , D.R. Kincaid, and F.T. Krogh (1979). “Basic Linear 
Algebra Subprograms for FORTRAN Usage," ACM Trana. Math. Soft. 5, 308—323. 

C.L. Lawson, R.J. Henson, D.R- Kincaid, and F.T. Krogh (1979). "Algorithm 539, 
Basic Linear Algebra Subprograms for FORTRAN Usage," ACM Trans. Math. Soft. 
5, 324-325. 

J.J. Dongarra, J. Du Croz, S. Hammiarling, and R.J. Hanson (1988). “An Extended Set 
of Fortran Basic Linear Algebra Subprograma,” ACM Trans. Math. Soft. 14, 1-17. 

J.J. Dongarra, J. Du Cros, S. Hammarling, and R.J. Hanson (1988). “Algorithm 656 An 
Extended Set of Fortran Basic Linear Algebra Subprograms: Model Implementation 
and Test Programs,” ACM Trans. Math. Soft. 14, 18-32. 

J.J. Dongarra, J. Du Cros, LS. Duff, and S.J. Hammarling (1990). “A Set of Level 3 
Basic Linear Algebra Subprograms,” ACM Trans. Math. Soft. 16, 1-17. 

J.J. Dongarra, J. Du Cros, L5. Duff, and S.J. Hammarling (1990). "Algorithm 679. A 
Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Teat 
Programs,” ACM Trans. Math. Soft. 16, 18-28. 


Other BLAS references include 


B. K&gstróm, P. Ling, and C. Van Loan (1991). “High-Performance Level-3 BLAS: 
Sample Routines for Double Precision Real Data,” in High Performance Computing 
II, M. Durand and F. El Dabaghi (eds), North-Holland, 269-281. 

B. Kágstróm, P. Ling, and C. Van Loan (1995). "GEMM-Based Level-3 BLAS: High- 
Performance Model Implementations and Performance Evaluation Benchmark,” in 
Paruliel Programming and Applications, P. Fritzon and L. Finmo (eis), ISO Press, 
184-188. 


For an appreciation of the subtleties associated with software development we recommend 


J.R. Rice (1981). Matriz Computations and Mathematical Software, Academic Press, 
New York. 


and a browse through the LAPACK manual. 
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1.2 Exploiting Structure 


The efficiency of a given matrix algorithm depends on many things. Most 
obvious and what we treat in this section is the amount of required arith- 
metic and storage. We continue to use matrix-vector and matrix-matrix 
multiplication as a vehicle for introducing the key ideas. As examples of 
exploitable structure we have chosen the properties of bandedness and sym- 
metry. Band matrices have many zero entries and so it is no surprise that 
band matrix manipulation allows for many arithmetic and storage short- 
cuts. Arithmetic complexity and data structures are discussed in this con- 
text. 

Symmetric matrices provide another set of examples that can be used to 
illustrate structure exploitation. Symmetric linear systems and eigenvalue 
problems have a very prominent role to play in matrix computations and 
so it is important to be familiar with their manipulation. 


1.2.1 Band Matrices and the x-0 Notation 


We say that A e R™*” has lower bandwidth p i£ aij = 0 whenever i > j +p 
and upper bandwidth q ifj > i+ gq implies aj; = 0. Here is an example of 
an 8-by-5 matrix that has lower bandwidth 1 and upper bandwidth 2: 


ocooooxx 
aoooox xx 
oOcooxxxx 
2OOx xx xo 
cOoOoxxxxoo 


The x's designates arbitrary nonzero entries. his notation is handy to 
indicate the zero-nonzero structure of a matrix and we use it extensively. 
Band structures that occur frequently are tabulated in Table 1.2.1. 


1.2.2 Diagonal Matrix Manipulation 


Matrices with upper and lower bandwidth zero are diagonal. If D c R™*" 
is diagonal, then 


D = diag(dy, ...,d,), q = min(m,n) s d= di 


If D is diagonal and A is a matrix, then DA is a row scaling of A and AD 
is a column scaling of A. 
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Type Lower Upper 
of Matrix Bandwidth | Bandwidth 
a 


tridiagonal 

upper bidiagonal 
lower bidiagonal 
upper Hessenberg 
lower Hessenberg 


TABLE 1.2.1. Band Terminology for m-by-n Matrices 


1.2.3 Triangular Matrix Multiplication 


To introduce band matrix “thinking” we look at the matrix multiplication 
problem C = AB when A and B are both n-by-n and upper triangular. 
The 3-by-3 case ia illuminating: 


Gibi anb + Giò dirbs + aizbe3 + 73693 


C = 0 122523 422023 + a23533 


0 0 33 b33 


It suggests that the product is upper triangular and that its upper trian- 
gular entries are the result of abbreviated inner products. Indeed, since 
Gikby; = 0 whenever k <i or j < k we see that 


i 
ej = > Gikbkj 
kat 


and so we obtain: 


Algorithm 1.2.1 (Triangular Matrix Multiplication) If A,B € R"*" 
are upper triangular, then this algorithm computes C = AB. 


C zx 
for i = 1:n 
for j = in 
for k 2 ij 
C(i,j) = A(t, k)B(k, j) + C(3.5) 
end 
end 
end 
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To quantify the savings in this algorithm we need some tools for measuring 
the amount of work. 


1.2.4 Flops 


Obviously, upper triangular matrix muitiplication involves less arithmetic 
than when the matrices are full. One way to quantify this is with the notion 
of a flop. A top! is a floating point operation. A dot product or saxpy 
operation of length n involves 2n flops because there are n multiplications 
and n adds in either of these vector operations. 

The gaxpy y = Ar + y where A € R™*” involves 2mn flops as does an 
m-by-n outer product update of the form A = A+ zyT. 

The matrix multiply update C = AB+C where A € IR™*?, B c IRP*^, 
and C € R™*" involves 2mnp flops. 

Flop counts are usually obtained hy summing the amount of arithmetic 
associated with the most deeply nested statements in an algorithm. For 
matrix-matrix multiplication, this is the statement, 


Ci, j) = Ali, k)B(k. j) + Cli, 3) 


which involves two flops and is executed mnp times as a simple loop ac- 
counting indicates. Hence the conclusion that general matrix multiplication 
requires 2mnp flops. 

Now let us investigate the amount of work involved in Algorithm 1.2.1. 
Note that cj;, (i < j) requires 2(j — i + 1) flops. Using the heuristics 


and 


pol 
we find that triangular matrix multiplication requires one-sixth the number 
of flops as full matrix multiplication: 


n n . . none 39 -i+1}% E 3 
DE- D ya DME Lee E, 


ix] jmi i=] j=l iml i=l 


We throw away the low order terms since their inclusion does not contribute 
to what the flop count “says.” For example, an exact flop count of Algo- 
rithm 1.2.1 reveals that precisely n?/3 + n? + 2n/3 flops are involved. For 


lin the first edition of this book we defined a flop to be the amount of work associated 
with an operation of the form dij = aj; + agzagy, i-e, a floating point add, a floating 
point multiply, and some subscripting. Thus, an “old flop" involves two “new flops.” In 
defining a flop to be a single floating point operation we are opting for a more precise 
measure of arithmetic complexity. 
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large n (the typical situation of interest) we see that the exact flop count 
offers no insight beyond the n?/3 approximation. 

Flop counting is a necessarily crude approach to the measuring of pro- 
gram efficiency since it ignores subscripting, memory traffic, and the count- 
less other overheads associated with program execution. We must not infer 
too much from a comparison of flops counts. We cannot conclude, for ex- 
ample, that triangular matrix multiplication is six times faster than square 
matrix multiplication. Flop counting is just a “quick and dirty" accounting 
method that captures only one of the several dimensions of the efficiency 
issue. 


1.23.5 The Colon Notation-Again 


The dot product that the k-loop performs in Algorithm 1.2.1 can be suc- 
cinctly stated if we extend the colon notation introduced in $1.1.8. Suppose 
A € R™*" and the integers p, q, andr satisfy l < p S q X nandi €r «€ m. 
We then define 

A(r, 14) = [arp REEL ] € RX Goeth) t 
Likewise, if 1 < p< 4 € m and 1 € c < n, then 


(pc 
A(rac)-| : | emet, 
ec 
With this notation we can rewrite Algorithm 1.2.1 as 
C(1:m,1:m) = 0 
for i= l:n 
for j = i:n 
C(i, j) = Ali, ij) BZ, j) + Cl, j) 
end 
end 
We mention one additional feature of the colon notation. Negative in- 
crements are allowed. Thus, if z and y are n-vectors, then s = zT y(n: — 1:1) 


is the summation a 
s= zn . 
ixi 


1.2.6 Band Storage 


Suppose A c E?*" has lower bandwidth p and upper bandwidth g and 
assume that p and q are much smaller than n. Such a matrix can be stored 
in a (p +q + 1)-by-n array A.band with the convention that 


ay = Aband(i — j - 4 7-1, j) (1.2.1) 
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for all (i, 7) that fall inside the band. Thus, if 


11. 013. Gi3 ü 0 9 
an Gy 3 oy 0 0 
Ü az ü33 üu ig 0 
A= i 
0 QG G43 O44 Gan Ong 


then 


Ü a a & a a 
Akand = 12 023 @3q G45 59 
Gil G22 33 0044 G55 Ges 
421 432 G43 G5q Ges OG 


Here, the “0” entries are unused. With this data structure, our column- 
oriented gaxpy algorithm transforms to the following: 


Algorithm 1.2.2 (Band Gaxpy) Suppose A € F*" has lower band- 
width p and upper bandwidth q and is stored in the A.band format (1.2.1). 
Ifz,y € R”, then this algorithm overwrites y with Az + y. 


for j = lin 
Ytop = max(1, j — q) 
Ybot = min(n, j + p) 
Qeop = max(1,q + 2 — J) 
abot = Stop + Yooe — Ytop 
4 VUtop:ifbon) = z(j) A. band(arop:asor, j} + Y(Ytop:Ybat) 
en 


Notice that by storing A by column in A.band, we obtain a saxpy, column 
&ccess procedure. Indeed, Algorithm 1.2.2 is obtained from Algorithm 1.1.4 
by recognizing that each saxpy involves a vector with a small number of 
nonzeros. Integer arithmetic is used to identify the location of these nonze- 
TOS. As a result of this careful zero/nonzero analysis, the algorithm involves 
just 2n(p +q +1) flops with the assumption that p and q are much smaller 
than n. 


1.2.7 Symmetry 
We say that A € R”*"is symmetric if AT = A, Thus, 


12 3 
A=/2 45 
3.56 
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is symmetric. Storage requirements can be halved if we just store the lower 
triangle of elements, e.g., A.vec=[1 2 3 4 5 6 ]. In general, with 
this data structure we agree to store the aj; as follows: 


ay = Aveli- ljn- ig -1)/2-i) (izj) (1.2.2) 


Let us look at the column-oriented gaxpy operation with the matrix 4 
represented in A.vec. 


Algorithm 1.2.3 (Symmetric Storage Gaxpy) Suppose A € IR?*" is 
symmetric and stored in the A.vec style (1.2.2). If z,y € IR^, then this 
algorithm overwrites y with Ar + y- 


for j= i:n 
fori-i1j-1 
y(i) = A.vec((i — 1)n — ifi 7 1)/2 + ijel) + y(i) 
end 
for i = jn 
y(t) = A.vec(( — 1)n - 3G — 1)/2 + Oz) + yli) 
end 
end 


This algorithm requires the same 2n? flops that an ordinary gaxpy requires. 
Notice that the halving of the storage requirement is purchased with some 
awkward subscripting. 


1.2.8 Store by Diagonal 
Symmetric matrices can also be stored by diagonal. If 


123 
A-2|245]|, 
356 


then in a store-by-diagonal scheme we represent A with the vector 
Adieg=[1 4 6 2 5 3]. 
In general, if i 7 j, then 
Gk, = A.diag(i + nk — ktk — 1)/2) (k > 0) (1.2.3) 
Some notation simplifies the discussion of how to use this data structure in 
a matrix-vector multiplication. 


If A c R™*", then let D(A, k) c R™*" designate the kth diagonal of A 
as follows: 


ay faith, 1<igm, 1<jen 
[D(A khi; = { Q 7 otherwise. , 


22 CHAPTER 1. MATRIX MULTIPLICATION PROBLEMS 


Thus, 

12 3 00 3 0 2 0 

A= 245);=/0 0 O0/+/0 0 § 

3.5 6 0 0 0 0 0 0 

——— 
D(A,2) D(A A) 

10 0 00 0 0.09 0 
+/0 4 0/+;2 0 GF+;7000 
006 0580 3.00 
D(A,0) D(A,-1) D(A,—2) 


Returning to our store-by-diagonal data structure, we see that the nonzero 
parts of D(A,0), D(A,1),..., D(A,n — 1) are sequentially stored in the 
A.diag scheme (1.2.3). The gaxpy y = Ar + y can then be organized as 
follows: 
n-1 ` 
y = D(A, 0z + S (D(A, k) + DUAE) )z + y. 


kml 


Working out the details we obtain the following algorithm. 


Algorithm 1.2.4 (Store-By-Diagonal Gaxpy) Suppose A € IR?" is 
symmetric and stored in the A.diag style (1.2.3). If z,y € IR", then this 
algorithm overwrites y with AT + y. 


for i = En 

y(i} = A.diag(i)z(1) + y(i) 
end 
fork — :n- 1 


t=nk—k(k-1)/2 
(y = D(A,k)z +y} 
for i = lin -k 
y(t) = A.diag(i + t}x(i + k) + y(i) 
end 
{y = D(A, K)Tz +y} 
for i=lin—k 
y(i + k) = A.díag(i + t)z(t) + yli + k) 
end 
end 


Note that the inner loops oversee vector multiplications: 


y(1in — k) = A.diag(t + l:t +n — k). + z(k + Ln) + (En — k) 
V(k + lin) = A.diag(t + i:t +n — k). « 2(1in — k) + y(k + Ln) 
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1.2.9 A Note on Overwriting and Workspaces 


An undercurrent in the above discussion has been the economical use of 
storage. Overwriting input data is another way to control the amount of 
memory that a matrix computation requires. Consider the n-by-n matrix 
multiplication problem C — AB with the proviso that the "input matrix" 
B is to be overwritten by the "output matrix” C . We cannot simply 
transform 


C(1:n, 1n) = 0 
for j = lm 
for k = i; 
Chj) = CO H+ Az k) BCR, j) 
end > 
end 
to 
for j 21m 
for k = l:n 
B(,3) = B( j) + AG k}B(k, j) 
end 
end 


because B(:, j) is needed throughout the entire k-loop. A linear workspace 
is needed to hold the jth column of the product until it is "safe" to overwrite 
B(:, j): 
for j = lm 
wim) 20 
for k = Ln 
w(:) = w(:) + AG, k)B(k, j) 
end 
B(:, j) = w(:) 
end. 


A linear workspace overhead is usually not important in a matrix compu- 
tation that has a 2-dimensional array of the same order. 


Problems 


P1.2.1 Give an algorithm that overwrites A with A? where A c R *™ is (a) upper 
triangular and (b) square. Strive for & minimum workspace in each case. 

P1.2.2 Suppose A € E? *" is upper Hessenberg and that scalars À1,..., À- sre given. 
Give a saxpy algorithm for computing the first column of M = (A— X11)---(A— àrt). 
P1.2.3 Give a column saxpy algorithm for the n-by-n matrix multiplication problem 
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C = AB where A is upper triangular and B is lower triangular. 

P1.2.4 Extend Algorithm 1.2.2 ao that it can handle rectangular band matrices. Be 
sure to describe the underlying data structure. 

P1.2.5 A c E**" is Hermitian if AH = A. If A= B-FiC, then it is easy to show that 
BT = B and CT = —C. Suppose we represent A in an array A.herm with the property 
that AL herm(s, j) houses b;; if i > j aud ci; if 7 > i. Using this data structure write a 
matrix-vector multiply function that computes Re(z) and Im(z) from Re(z) and Im(z) 
80 that x = Az. 

P1.2.d Suppose X c R**? and A & RX", with A symmetric and stored by diagonal. 
Give an algorithm that computes Y = XT AX and stores the result by diagonal. Use 
separate arrays for A and Y. 

P1.2.7 Suppose a € R^ is given and that A € R"*" has the property that aij = 
2j-|1- Give an algorithm that overwrites y with Ar + y where z,y € R” are given. 
P1.2.8 Suppose a € R" is given and that A € R™*" has the property that a4 = 
A((i--j 1) mod n)41. Give an algorithm that overwrites y with Az + y where z, y € R” 
&re given. 

P1.2.8 Develop a compact store-by-diagonal scheme for unsymmetric band matrices 
and write the corresponding gaxpy algorithm. 

P1.2.10 Suppose p and q are n-vectors and that A = (ai;) is defined by aj; = aj; = peqj 
fonl£it€]j <n. How many flope are required to compute y = Az where r € R” is 
given? 


Notes and References for Sec. 1.2 


Consult the LAPACK manual for a discussion about appropriate data structures when 
symmetry and/or bandedness is present. See also 


N. Madsen, G. Roderigue, and J. Karush (1976). “Matrix Multiplication by Diagonals 
on a Vector Parallel Processor," Infomation Processing Letters 5, 41-45. 


1.3 Block Matrices and Algorithms 


Having a facility with block matrix notation is crucial in matrix computa- 
tions because it simplifies the derivation of many central algorithms. More- 
over, “block algorithms” are increasingly important in high performance 
computing. By a block algorithm we essentially mean an algorithm that 
is rich in matrix-matrix multiplication. Algorithms of this type turn ont 
to be more efficient in many computing environments than those that are 
organized at a lower linear algebraic level. 


1.3.1 Block Matrix Notation 


Column and row partitionings are special cases of matrix blocking. In 
general we can partition both the rows and columns of an m-by-n matrix 
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A to obtain 
Au "T Air TH 
A=]: 
Ag ` Age J m 
n ne 


where m, ^ m, =m, ni kc: n, =n, and Aag designates the 
(a, 8) block or submatrix. With this notation, block Aag has dimension 
ma-by-ng and we say that A = (Aap) is a g-by-r block matrix. 


1.3.2 Block Matrix Manipulation 


Block matrices combine just like matrices with scalar entries as long as 
`- certain dimension requirements are met. For example, if 


By... By my 

B = 1 : > 
Bu" Be Ma 
Thy Ny 


then we say that B is partitioned conformadly with the matrix A above. 
The sum C = A+ B can also be regarded as a q-by-r block matrix: 


Cu c C An tBu c Ay + By 


Cu c Cg Ag tBa e Agr + Bar 


The multiplication of block matrices is a little trickier. We start with a pair 
of lemmas. 


Lemma 1.3.1 Jf A € R™*?, B c RP*^, 


Ai] m 
A= B= [ Bi 1 1 B. ] , 
Ag | m, m ne 
then 

Cy... Cir "mi 

ABI Cox : 
Cu 7 Cor m, 
ni n 


where Cag = AgBg for a = lig and B = Lr. 
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Proof. First we relate scalar entries in block Cag to scalar entries in C. 
Foris<a<q,1<f8<r,1<t<m,, andl <j < ng we have 


[Casali = Cititi 


where 


But 


p P 
Cui = Y Gaticdeuty = Y [Aala [Bol = IAaBsl- 
k=l kel 


Thus, Cag = 448.0 


* 


Lemma 1.3.2 [f A € R™*?, B c IRP*", 


A= [ Aia. As ] , and B= : ; 
m Ps B, Ps 


AB = € = Y: AJB,. 


yæl 


Proof. We set s = 2 and leave the general s case to the reader. (See 
P1.3.6.) For 1 € i < m and 1 < j € n we have 


P m PitPa 
Gj = Y oss = abi; + Y aab, 
kul kal Kkupi-dl 


IA Bil t [43 B3]; = [A, By + Az Bal; . 
Thus, C = A,B, + A3 B4. O 

For general block matrix multiplication we have the following result: 
Theorem 1.3.3 If 


Ay... A my Bu ... Bir nm 

A- foro: | B=] po: 
Aq Aga Mg Ba 77 Be Ps 
Pi Pa my ne 
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and we partition the product C = AB as follows, 


Cu... Cir mi 

C= : : , 
Co Ut Co Ma 
ni n 


then , 
Cag = » Aes Bag a=lg, g=lLr. 
yal 


Proof. See P1.3.7. G 


A very important special case arises if we set s = 2, r = 1, and ni = 1: 


Au An Tı | _ | Anzi + Aioz2 
An An Ta Agi, + Azz | 


This partitioned matrix-vector product is used over and over again in sub- 
sequent chapters. 


1.3.3 Submatrix Designation 


As with “ordinary” matrix multiplication, block matrix multiplication can 
be organized in several ways. To specify the computations precisely, we 
need some notation. 
Suppose A € IR"*" and that i = (i,...,i4] and j = (ji,...,jc) are 

integer vectors with the property that 

d.d. € {1,2,...,m} 

ju--:1Je € {1,2,..., 0}. 
We let A(i, j) denote the r-by-c submatrix 


Anji) ee Alr je) 
A(i, j) = : 

Alir 1) UT Aic. je) 

If the entries in the subscript vectors i and 7j are contiguous, then the 
"colon" notation can be used to define A(i, j) in terms of the scalar entries 
in A. In particular, if 1 < ij € ig X m and 1 € jj < jo S n, then 
A(iy:t9, 1:52) is the submatrix obtained by extracting rows i, through i» 
and columns jı through jo, e.g, 


931 32 
A(3:5, 1:2) = G4) da42 . 


ü&1 452 
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While on the subject of submatrices, recall from §1.1.8 that if i and j are 
scalars, then A(i,:) designates the ith row of A and A(:,7) designates the 
jth column of A. 


1.3.4 Block Matrix Times Vector 


An important situation covered by Theorem 1.3.3 is the case of a block 
matrix times vector. Let us consider the details of the gaxpy y = AT +y 
where A c R™*", z c R^, y e R”, and 
Aj m n| m 
A= |: on 

„Aa ] T Ye | T". 
We refer to A; as the ith block row. If m.vee = (mi, . .. m4) is the vector 
of block row "heights", then from 


y Ay yı 
psj fet]: 
L5 Ay Yq 
we obtain 
last = 0 
for i = l:q 
first = last -1 
last = first + m.vec(i) - 1 (1.3.1) 
y(firstzlast) = A( first:last,:)x + y(first:last) 
end 


Each time through the loop an “ordinary” gaxpy is performed so Algorithms 
1.1.3 and 1.1.4 apply. 

Another way to block the gaxpy computation is to partition A and z as 
follows: 


A= [AA] z= 
ni Thy Ir n, 


In this case we refer to A; as the jth block column of A. If n.vec = 
(n1,..., n.) is the vector of block column widths, then from 


Tl r 
+y= M Ayz; ty 


fuel 


y = [Ar A] 
Ty 


we obtain 
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last = Q 
forj-Lr 
first = last +1 
last = first + n.vec(]) - 1 (1.3.2) 
y = A(:, first:last)z(first:last) + y 
end 


Again, the gaxpy's performed each time through the loop can be carried 
out with Algorithm 1.1.3 or 1.1.4. 


1.3.5 Block Matrix Multiplication 


Just as ordinary, scalar-level matrix multiplication can be arranged in sev- 
eral possible ways, so can the multiplication of block matrices. Different 
blockings for A, B, and C can set the stage for block versions of the dot 
product, saxpy,-and outer product algorithms of $1.1. To illustrate this 
with a minimum of subscript clutter, we assume that these three matrices 
are all n-by-n and that n = Né where N and f are positive integers. 
If A = (Aga), B = (Bag), and C = (Cag) are N-by-N block matrices 

with £-by-£ blocks, then from Theorem 1.3.3 

N 

Cap = Y AayByg + Cap  a-LN, BEEN. 

y= 
If we organize a matrix multiplication procedure around this summation, 
then we obtain a block analog of Algorithm 1.1.5: 


for a = 1:N 
i= {a — 1) riof 
for 8 — LN 
j= (8—-1)é+ 1:88 (1.3.3) 
for y =1:N 
k= (y—1)€+ 1-72 
C(i,j) = A(i, k) B(k, 3) + Ct, j) 
end 
end 
end 


Note that if £ = 1, then a =i, 8 = j, and y = k and we revert to Algorithm 
1.1.5. 
To obtain a block saxpy matrix multiply, we write C = AB -- C as 
Bu c Bin 
(1... €. ] 2 [ Ai, Aw ] +[ € €x 


Bur - Byun 
where Aa, C, € R'**, and Bag c BR“. From this we obtain 
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for B=1:N 
j-2(8—124 «xt 
for x = LN 
i=(a-1)f+1l:aé (1.3.4) 
CG, 9) = AG, BG, 3) + CG, 9) 
end 
end 


This is the block version of Algorithm 1.1.7. 
A block outer product scheme results if we work with the blockings 


BT 
A = [A,..., An] B= : 
By 
where A,,B, € R'**. From Lemma 1.3.2 we have 
N 
C-S ABI + 
yal 
and so 
for y= LN 
k=(y- 14 iur 
C = A(:,k)B(k,:) c C (1.3.5) 
end 


This is the block version of Algorithm 1.1.8. 


1.3.6 Complex Matrix Multiplication 
Consider the complex matrix multiplication update 
Ci -iC4 = (Ai T 143)((Bi + iBe) + (Ci TC) 


where all the matrices are real and i? = —1. Comparing the real and 
imaginary parts we find 


Ci = A1B4 — AB: + C, 
Cp = — A,B4 4 A+ 


and this can be expressed as follows: 


ajla alala] 
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This suggests how real matrix software might be applied to solve complex 
matrix problems. The only snag is that the explicit formation of 


i [4A As 
iE X 


requires the “double storage" of the matrices A; and Ag. 


1.3.7 A Divide and Conquer Matrix Multiplication 


We conclude this section with a completely different approach to the matrix- 
matrix multiplication problem. The starting point in the discussion is the 
2-by-2 block matrix multiplication 


Cu Cia |_| Ai Anz Bu Bis 
C31 Cn Án Án By Bag 


where each block is square. In the ordinary algorithm, C;; = Ai Bi + 
Aj2Bo;. There are 8 multiplies and 4 adds. Strassen (1969) has shown how 
to compute C with just 7 multiplies and 18 adds: 


P = (An + Aga)( Bi + Baz) 
P, = (An + An) Bi 
P = Ay (Biz — Ba) 
Poo = Ago( Bo — Bir) 
Py = (Aun + Ata) Ba 
P = (An — ÁAun)(Bn + Biz) 
Pp = {An -An)(Ban + Bn) 
Cu = A+R- P: +P 
Cy = Rh+ 
Cy = P+ 
Cn = P -P-PELPBE 


These equations are easily confirmed by substitution. Suppose n = 2m so 
that the blocks are m-by-m. Counting adds and multiplies in the compu- 
tation C = AB we find that conventional matrix multiplication involves 
(2m)? multiplies and (2m)* — (2m)? adds. In contrast, if Strassen's al- 
gorithm is applied with conventional multiplication at the block level, then 
Tm? multiplies and 7m? + 11m? adds are required. 1f m => 1, then the 
Strassen method involves about 7/8ths the arithmetic of the fully conven- 
tional algorithm. 

Now recognize that we can recur on the Strassen idea. Ín particular, we 
can apply the Strassen algorithm to each of the half-sized block multiplica- 
tions associated with the P;. Thus, if the original A and B are n-by-n and 
n = 2*, then we can repeatedly apply the Strassen multiplication algorithm. 
At the bottom “level,” the blocks are 1-by-1. Of course, there is no need to 
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recur down to the n = 1 level. When the block size gets sufficiently small, 
(n € nmin), it may be sensible to use conventional matrix multiplication 
when finding the P, . Here is the overall procedure: 


Algorithm 1.3.1 (Strassen Multiplication) Suppose n = 2° and that 
AC€R'*" and Be R?*^. If nmin = 24 with d < g, then this algorithm 
computes C — AB by applying Strassen procedure recursively q — d times. 


function: C = strass(A, B, n, nmin) 

ifn < Nmin 
C=AB 

eise 
m= n/2;u = km = m+ iin; 
P, = strass(A(u, u) + A(u, v), B(u, u) + B(v, v), M, nmin) 
P, = strass(A(v, u) + A(v, v), B(u, u), m, nmin) 
P3 = strass(A(u, u), B(u,v) — B (v, v), m, nmin) 
P4 = strass( A{v, v), B(v, u) — B(u, u), m. nmin) 
P; = strass(A(u,u) + A(u, v), B(v, Y), M, amin) 
P; = strass(A(v, u) — A(u, u), B(u,u) + B(u,v), m, nmin) 
P; = strass(A(u,v) — A(v, v), B(v,u) + B(v,v), m, nmin) 
C(u,u) = Pi + Py — Ps + Pr 
C(u, v) = Ps + Ps 
C(v,u) = Pa + Py 
C(v,v) = Py + B — Pat Pa 

end 


Unlike any of our previous algorithms strass is recursive, meaning that 
it calls itself. Divide and conquer algorithms are often best described in 
this manner. We have presented this algorithm in the style of à MATLAB 
function so that the recursive calls can be stated with precision. 

The amount of arithmetic associated with strass is a complicated func- 
tion of n and nmin. If Amin > 1, then it suffices to count multiplications 
as the number of additions is roughly the same. If we just count the mul- 
tiplications, then it suffices to examine the deepest level of the recursion 
as that is where all the multiplications occur. In strass there are q — d 
subdivisions and thus, 7974 conventional matrix-matrix multiplications to 
perform. These multiplications have size nmin and thus strass involves 
about s = (22)37*-7 multiplications compared toc = (27)?, the number 
of multiplications in the conventional approach. Notice that 


iy Qr 
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If d = 0 , i.e., we recur on down to the 1-by-1 level, then 
7 4 
$= (s) e = T = open? QE 


Thus, asymptotically, the number of multiplications in the Strassen proce- 
dure is O(n?50T). However, the number of additions (relative to the number 
of multiplications) becomes significant as nmin gets small. 


Example 1.3.1 lf n = 1024 and nmin = 64, then strass involves (7/8)19-5 z; .6 the 
arithmetic of the conventional algorithm. 


Problema 


P1.3.1 Generalize (1.3.3) so that it can handie the variable block-size problem covered 
by Theorem 1.3.3. M 


P1.3.2 Generalize (1.3.4) and (1.3.5) so that they can handle the variable block-size 
case, 

P1.3.3 Adapt strass so that it can handle square matrix multiplication of any order. 
Hint: If the “current” A has odd dimension, append a zero row and column. 


P1.3.4 Prove that if 


An co Air 
As : -n : 
An oco Age 
is a blocking of the matrix A, then 
AR o AR 
AT z : : 
Aoc AE 


P1.3.5 Suppose n is even and define the following function from R” to R: 
nf2 
f(z) = 2(:2m)Ta(2n) = S 2a 1231 
i=l 
(a) Show that if z, y € R” then 
n/i 
27y = S (zu-1 + yalen vai) - fr) - Ay) 
i=l 
(b) Now consider the n-by-n matrix multiplication C = AB, Give an algorithm for 
computing this product that requires n? /2 multiplies once f is applied to the rows of A 
and the columns of B. See Winograd (1968) for details. 
P1.3.8 Prove Lemma 1.3.2 for general s. Hint. Set 


Py =P te + Py T7 Xs 
and ghow that 


s. Pul 


u= X tikk. 


vul kapy +1 
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P1.3.7 Use Lemmas 1.3.1 and 1.3.2 to prove Theorem 1.3.3. In particular, set 
Aly 
Ay] : and By =[ By © B4] 
An 
and note from Lemma 1.3.2 that 


€-Y A. 
T-1 


Now analyze each A4 B4 with the help of Lemma 1.3.1. 


Notes and References for Sec. 1.3 


Far quite some time fast methods for matrix multiplication have attracted a lot of at- 
tention within computer science. See 


8. Winograd (1968). “A New Algorithm for Inner Product,” IEEE Trana. Comp. C-17, 
693-694. 

V. Strassen (1969). "Gaussian Elimination is Not Optimal," Numer. Math. 13, 354-356. 

V. Pan (1984). "How Can We Speed Up Matrix Multiplication?" SIAM Reyiew 26, 
393416. 


Maay of these methods have dubious practical value. However, with the publication of 


D. Bailey (1988). “Extra High Speed Matrix Multiplication on the Cray-2," SIAM J. 
Sc and Stat. Comp. 9, 603—607. 


it ts clear that the blanket dismissal of these fast procedures is unwise, The "stability" 
of the Strassen algorithm is discussed in $2.4.10. See also 


N.J. Higham (1990). “Exploiting Fast Matrix Multiplication within the Level 3 BLAS,” 
ACM Trans, Math. Soft. 16, 352-368. 

C.C. Douglas, M. Heroux, G. Slishman, and R.M. Smith (1994). “GEMMW: A Portable 
Level 3 E. Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm," 
J. Comput. Phys. 110, 1-10. 


1.4 Vectorization and Re-Use Issues 


The matrix manipulations discussed in this book are mostly built upon 
dot products and saxpy operations. Vector pipeline computers are able 
to perform vector operations such as these very fast because of special 
hardware that is able to exploit the fact that a vector operation is a very 
regular sequence of scalar operations. Whether or not high performance 
is extracted from such a computer depends upon the length of the vector 
operands and a number of other factors that pertain to the movement of 
data such as vector stride, the number of vector loads and stores, and 
the level of data re-use. Our goal is to build a useful awareness of these 
issues, We are not trying to build a comprehensive model of vector pipeline 
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computing that might be used to predict performance. We simply want to 
identify the kind of thinking that goes into the design of an effective vector 
pipeline code. We do not mention any particular machine. The literature 
is filled with case studies. 


1.4.1 Pipelining Arithmetic Operations 


The primary reason why vector computers are fast has to do with pipelin- 
ing. The concept of pipelining is best understood by making an analogy to 
assembly line production. Suppose the assembly of an individual automo- 
bile requires one minute at each of sixty workstations along an assembly 
line. If the line is well staffed and able to initiate the assembly of a new car 
every minute, then 1000 cars can be produced from scratch in about 1000 
+ 60 = 1060 minutes. For a work order of this size the line has an effective 
“vector speed" of 1000/1060 automobiles per minute. On the other hand, 
if the assembly line is understaffed and a new assembly can be initiated 
just once an hour, then 1000 hours are required to produce 1000 cars. In 
this case the line has an effective "scalar speed" of 1/60th automobile per 
minute. 

So it is with a pipelined vector operation such as the vector add z = x+y. 
The scalar operations z; = z; + y; are the cars. The number of elements 
is the size of the work order. If the start-to-finish time required for each 
z; is T, then a pipelined, tength n vector add could be completed in time 
much less than nr. This gives vector speed. Without the pipelining, the 
vector computation would proceed at a scalar rate and would approximately 
require time nr for completion. 

Let us see how a sequence of floating point operations can be pipelined. 
Floating point operations usually require several cycles to complete. For 
example, a 3-cycle addition of two scalars r and y may proceed as in 
FtG.1.4.1. To visualize the operation, continue with the above metaphor 


Fic. 1.4.1 A 3-Cycle Adder 


and think of the addition unit as an assembly line with three “work sta- 
tions". The input scalars r and y proceed along the assembly line spending 
one cycle at each of three stations. The sum z emerges after three cycles. 
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Adjust 
Exponents Add 


Normalize 
tt Tio Zo Ig 
zt raa 
fe Le Fin Po ai 
Fic. 1.4.2 Pipelined Addition 


Note that when a single, “free standing” addition is performed, only one of 
the three stations is active during the computation. 

Now consider a vector addition z = x+y. With pipelining, the z and y 
vectors are streamed through the addition unit. Once the pipeline is filled 
and steady state reached, a z; is produced every cycle. In FiG.1.4.2 we 
depict what the pipeline might look like once this steady state is achieved. 
In this case, vector speed is about three times scalar speed because the time 
for an individual add is three cycles. 


1.4.2 "Vector Operations 


A vector pipeline computer comes with a repertoire of vector instructions, 
such as vector add, vector multiply, vector scale, dot product, and saxpy. 
We assume for clarity that these operations take place in vector registers. 
Vectors travel between the registers and memory by means of vector load 
and vector store instructions. 

An important attribute of a vector processor is the length of its vector 
registers w. we designate by v,. A length-n vector operation must be 
broken down into subvector operations of length v,or less. Here is how such 
a partitioning might be managed in the case of a vector addition z = T +y 
where r and y are ti-vectors: 


first c1 
while first <n 
last = min{n, first + v, — 1} 
Vector load x( fir stilast). 
Vector load y({ first:last}. 
Vector add: z(first:last) = x( first:last) + y(first:last). 
Vector store z( first:last). 
first = last +1 
end 


A reasonable compiler for a vector computer would automatically generate 
these vector instructions from a programmer specified z = x + y command. 
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1.4.3 The Vector Length Issue 


Suppose the pipeline for the vector operation op takes Ty, cycles to "set 
up." Assume that one component of the result is obtained per cycle once 
the pipeline is filled. The time required to perform an n-dimensional op is 
then given by 
Topi} = (Top ^n) | nSw. 

where j: is the cycle time and v, is the length of the vector hardware. 

If the vectors to be combined are longer than the vector hardware length, 
then as we have seen the overall vector operation must be broken down into 
hardware-manageable chunks. Thus, if 


N = NY + no O<no<u., 
then we assume that 


Tln) = 11 (Top + v.) no = 0 
- (mi(Top + Vi) + Top + ro) = no #0 
specifies the overall time required to perform a length-n op. This simplifies 
to 
Top(n) = (n + ro,ceil(n/vi)) # 
where ceil(c) is the smallest integer such that a < ceil(a). If p flops per 
component are involved, then the effective rate of computation for general 
n is given by 
pn p 1 
Rop(t) Sa OSE — a a . 
T»(n) M14 TE ceil (#) 
{If is in seconds, then Rep is in flops per second.) The asymptotic rate of 
performance is given by 
: _ 1 p 
aue) = Dome 

As a way of assessing how serious the start-up overhead is for a vector 
operation, Hockney and Jesshope (1988) define the quantity n;;; to be the 
smallest n for which half of peak performance is achieved, i.e., 

pu lp 
Tolm) 2a. 
Machines that have big n1/5 factors do not perform well on short vector 
operations. 

Let us sea what the above performance model says about the design 
of the matrix multiply update C = AB + C where A c R™*?, B e IRP?*", 
and C c IRP*^, Recall from §1.1.11 that there are six possible versions of 
the conventional algorithm and they correspond to the six possible loop 
orderings of 
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for i= lim 
for j = iin 
for k = Lp 
C(i, j) = A(i, K) B(k, j) + CG, j) 
end 
end 
end 


This is the ijk variant and its innermost loop oversees a length-p dot prod- 
uct. Thus, our performance model predicts that 


Tijk = mnp + mn - ceil(p/v;)Taa 


cycles are required. A similar analysis for each of the other variants leads 
to the following table: 


MNP + mn - Tag(p/v,) 
MAP + mn: Tas(p/v.) 


mnp + Mp: T,s,(n/v,) 
mnp + np - Tja (m/v,) 
MNP + Mp: Tsss(n/vi) 
MNP + np: Ta (m/v,) 


We make a few observations based upon some elementarv integer arithmetic 
manipulation. Assume that Tsar and Tyo; are roughly equal. If m, n, and 
p are all less than v,, then the most efficient variants will have the longest. 
inner loops. If m, n, and p are much bigger than v,, then the distinction 
between the six options is small. 


1.4.4 The Stride Issue 


The “layout” of a vector operand in memory often has a bearing on exécu- 
tion speed. The key factor is stride. The stride of a stored floating point 
vector is the distance (in logical memory locations) between the vector's 
components. Accessing a row in a two-dimensional Fortran array is not a 
unit stride operation because arrays are stored by column. In C, it is just 
the opposite as matrices are stored by row. Nonunit stride vector opera- 
tions may interfere with the pipelining capability of a computer degrading 
performance. 

To clarify the stride issue we consider how the six variants of matrix 
multiplication “pull up" data from the A, B, and C matrices in the inner 
loop. This is where the vector calculation occurs (dot product or saxpy) 
and there are three possibilities: 
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jki or kji: for i = l:m 
C(i,j) = C(3) + Ali, kj B(R, 7) 
end 
ikj or kij: for j 2 im 
C(1,3) = C(i,j) + AG, E) B(k, j} 
end 
ijk or jik: for k = Lip 
C(i,3) = CG, j) + Ali, K) B(k, 7) 
end 


Here is a table that specifies the A, B, and C strides associated with each 
of these possibilities: 


jEiorkji| Uni 0 Unit 
ikj or kij 0 Non-Unit | Non-Unit 
ijk or jik | Non-Unit Unit 0 


Storage in column-major order is assumed. A stride of zero means that only 
a single array element is accessed in the inner loop. From the stride point 
of view, it is clear that we should favor the jki and kji variants. This may 
not coincide with a preference that is based on vector length considerations. 
Dilemmas of this type are typical in high performance computing. One goal 
(maximize vector length) can conflict with another (impose unit stride). 
Sometimes a vector stride/vector length conflict can s be resolved through 
the intelligent choice of data structures. Consider the gaxpy y = Az +y 
where A e R'*" is symmetric. Assume that n < v, for simplicity. If 
A is stored conventionally and Algorithm 1.1.4 is used, then the central 
computation entails n, unit stride saxpy's each having length n: 


for j = Ln 
y= AC, JO) +y 


end 
Our simple execution model tells us that 
Ti = NTs n) 
cycles are required. 


In $1.2.7 we introduced the lower triangular storage scheme for sym- 
metric matrices and obtained this version of the gaxpy: 
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for j = 1:n 
for i= 1:j -1 
yli} = A.vec((i — 1)n — i(i — 1)/2  5)z(3) + v) 
end 
for i = jn 
4 yli) = A.vec((j — 1)n — jG — 1)/2 + ix) + y(i) 
en 


end 


Notice that the first i-loop does not define a unit stride saxpy. If we assume 
that a length n, nonunit stride saxpy is equivalent to n unit-length saxpys 
(& worst case scenario), then this implementation involves 


T;,-n (Frees + n) 


cycles. 
In §1.2.8 we developed the store-hy-diagonal version: 


for i= lin 

yli) = A.diag(i)x(i) + y(i) 
end 
for k= 1lin- | 


t = nk — k(k — 1)/2 
{y = D(A, k)z + y] 
for i= iin- k 
yli) = A.diag(i + t)z(i +k) + y(i) 
end 
{y = D(A, kfz + y} 
for i = ln — k 
y(i + k) = A.dtag(t + t)z(i) + y(i + k) 
end 
end 


Ín this case both inner loops define a unit stride vector multiply (vm) and 
our model of execution predicts 


Ts =N (2tom + n) 


cycles. 

The example shows how the choice of data structure can effect the stride 
attributes of an algorithm. Store by diagonal seems attractive because it 
represents the matrix compactly and has unit stride. However, a careful 
which-is-best analysis would depend upon the values of Tss and Tym, and 
the precise penalties for nonunit stride computation and excess storage. 
The complexity of the situation would call for careful benchmarking. 
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1.45 Thinking About Data Motion 


Another important attribute of a matrix algorithm concerns the actual vol- 
ume of data that has to be moved around during execution. Matrices sit 
in memory but the computations that involve their entries take place in 
functional units. The control of memory traffic is crucial to performance 
in many computers. To continue with the factory metaphor used at the 
beginning of this section: Can we keep the superfast arithmetic units busy 
with enough deliveries of matriz data and can we ship the results back to 
memory fast enough to avoid backlog? FiG.1.4.3 depicts the typical situa- 
tion in an advanced uniprocessor environment. Details vary from machine 


Fic. 1.4.3 Memory Hierarchy 


to machine, but two "axioms;; prevail: 


» Each level in the hierarchy has a limited capacity and for economic 
reasons this capacity is usually smaller as we ascend the hierarchy. 


e There is a cost, sometimes relatively great, associated with the moving 
of data between two levels in the hierarchy. 


The design of an efficient matrix algorithm requires careful thinking about 
the flow of data in between the various levels of storage. The vector touch 
and data re-use issues are important in this regard. 


1.4.6 The Vector Touch Issue 


In many advanced computers, data is moved around in chunks, e.g., vectors. 
The time required to read or write a vector to memory is comparable to 
the time required to engage the vector in a dot product or saxpy. Thus, the 
number of vector touches associated with a matrix code ia a very important 
statistic. By a “vector touch” we mean either a vector load or store. 
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Let's count the number of vector touches associated with an m-by-n 
outer product. Assume that m = mv, and n = niv, where v,is the vector 
hardware length. (See §1.4.3.) In this environment, the outer product 
update A = A + zyT would be arranged as follows: 


for a = l:m 
i = (a — l)v, + Lliav, 
for B = lini 
j= ($ -1)v + Epu 
A(i, j} = Ai, 3) + z(y)" 
end 
end 


Each column of the submatrix A(i,7) must be loaded, updated, and then 
stored. Not forgetting to account for the vector touches associated with x 
and y we see that approximately 


Y ( «Y 2) zs mın 


a=l Azl 


vector touches are required. (Low order terms do not contribute to the 
analysis.) 

Now consider the gaxpy update y = Az+y where y € R™, z € R” and 
Ac R™*", Breaking this computation down into segments of length v, 
gives 


for a = lim, 
i = (a —1)u, + lav, 
for B = lm 
j= (6~ Yu, + 1:84, 
wi) = y(t) + AG, 3)2(3) 
end 
end 


Again, each column of submatrix A(i, j) must be read but the only writing 
to memory involves subvectors of y. Thus, the number of vector touches 
for an m-by-n gaxpy is 


Y (2+ Ya +v,)) =% min. 


eal Bx 


This is half the number required by an identically-sized the outer product. 
Thus, if a computation can be arranged in terms of either outer producta 
or gaxpys, then the former is preferable from the vector touch standpoint. 
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1.4.7 Blocking and Re-Use 


A cache is a small high-speed memory situated in between the functional 
units and main memory. See F1G.1.4.3. Cache utilization colors perfor- 
mance because it has a direct bearing upon how data flows in between the 
functional units and the lower levels of memory. 

To illustrate this we consider the computation of the matrix multiply 
update C = AB + C where A, B, C € R?*" reside in main memory?. All 
data must pass through the cache on its way to the functional unita where 
the floating point computations are carried out. If the cache is small and 
n is big, then the update must be broken down into smaller parts so that 
the cache can “gracefully” process the flow of data. 

One strategy is to biock the B and C matrices, 

B= [ Bi,- By | C= (€i... Cn | 
£ £ t £ 


where we assume that n = £N. From the expansion 


Ca = ABa +Ca = Y Al:,k)Balky:) + Ca 
k=l 


we obtain the following computational framework: 


for a=1:N 
Load B, and C, into cache. 
for k = lin 
Load Á(:,k) into cache and update Ca: 
Ca = AC, kK) Balk, + Ca 
end 
Store C, in main memory. 
end 


Note that if M is the cache size measured in floating point words, then we 
must have 
Qnf+n< M. (1.4.1) 


Let I'4 be the number of floating point numbers that flow (in either direc- 

tion) between cache and main memory. Note that every entry in B is loaded 

into cache once, every entry in C is loaded into cache once and stored back 

in main memory once, and every entry in A is loaded into cache N = n/f 

times. It follows that 3 

TOT 

Ta = 3n? + FE 

The discussion which follows would also apply if the matrices were on a disk and 
needed to be brought into main memory. 
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In the interest of keeping data motion to a minimum, we choose £ to be as 
large as possible subject to the constraint (1.4.1). We therefore set 


1/M 


2n* 

M-n 
(We use “=” to emphasize the approximate nature of our analysis.) If cache 
is large enough to house the entire B and C matrices with room left over 
for a column of A, then £ = n and T, = 4n?. At the other extreme, if we 
can just fit three columns in cache, then / = 1 and Ty = n>. 

Now let us regard A = (dog) , B = (Bag), and C = (Cag) as N-by-N 
block matrices with uniform block size £ = n/N. With this blocking the 
computation of 


obtaining 


Ty = 3n? + 


N 
Cap = 9 AgyByg a = 1:N, B= 1:N 
yal 


can be arranged as follows: 


for a = 1:N 
for f =1:N 
Load Cag into cache. 
for y= 1:N 
Load Az, and Byg into cache. 
Cos = Cag + Áo, Bag 
end 
Store Cag in main memory. 
end 
end 


In this case the main memory/cache traffic sums to 


3 
T= 2n? + m 
t 

because each entry in A and B is loaded N = n/t times and each entry 

in C is loaded once and stored once. We can minimize this by choosing £ 

to be aa large as possible subject to the constraint that three blocks fit in 
cache, i.e., 

3 cM 


3 
2 3] 
I2 2: 2n* + 2n 4| —. 


Setting / z: y M/3 gives 
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A manipulation shows that 


4 2 
Ti Bat 22 3425, 


— m 2 —XK—= 
Pa amice dr 242/8548 


The key quantity here is n2/M, the ratio of matrix size (in floating point 
words) to cache size. As this ratio grows the we find that 


Dn 
Ty /35M 


showing that the second blocking strategy is superior from the standpoint 
of data motion to and from the cache. The fundamental conclusion to be 
reached from all of this is that blocking effects data motion. 


1.4.8 Block Matrix Data Structures 


We conclude this section with a discussion about block data structures. A 
programming language that supports two-dimensional arrays must have a 
convention for storing such a structure in memory. For example, Fortran 
stores two-dimensional arrays in column major order. This means that the 
entries within a column are contiguous in memory. Thus, if 24 storage 
locations are allocated for A c IRÍ*5, then in traditional store-by-column 
format the matrix entries are "lined up" in memory as depicted in Fic. 
1.4.4. In other words, if A € R™*” is stored in v(1:mn), then we identify 


Fic. 1.4.4 Store by Column (4-by-6 case) 


A(i, 7) with v((j — 1)m + i). For algorithms that access matrix data by 
column this is a good arrangement since the column entries are contiguous 
in memory. 
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Fic. 1.4.5 Store-by-Blocks (4-by-6 case with 2-by-2 Blocks) 


In certain block matrix algorithms it is sometimes useful to store matri- 
ces by blocks rather than by column. Suppose, for example, that the matrix 
A above is a 2-by-3 block matrix with 2-by-2 blocks. In a store-by-column 
block scheme with store-by-column within each block, the 24 entries are 
arranged in memory as shown in FiG. 1.4.5. This data structure can be 
attractive for block algorithms because the entries within a given block are 
contiguous in memory. 


Problems 


P1.4.1 Consider the matrix product D = ABC where A € R™*" , Be RX" and 
C € H^**. Assume that all the matrices are stored by column and that the time required 
to execute a unit-stride saxpy operation of length k is of the form t(k) = (L+ kju where L 
is a constant and y is the cycle time. Based on this model, when is it more economical tc 
compute D aa D = (AB)C instead of na D = A(BC)? Assume that all matrix multiplies 
are done using the jki, (gaxpy) algorithm. 

P1.4.2 What is the total time spent in jki variant on the saxpy operations assuming 
that all the matrices are stored by column and that the time required to execute s unit- 
stride saxpy operation of length k is of the form i(k) = (L + k)u where L is a constant 
and p is the cycle time? Specialize the algorithm so that it efficiently handles the case 
when A and B are n-by-n and upper triangular. Does it follow that the triangular 
implementation is six times faster as the flop count suggests? 

P1.4.3 Give an algorithm for computing C = AT BA where A and B are n-by-n and 
B is symmetric. Arrays should be accessed in unit stride fashion within all innermost 
loops. 

P1.4.4 Suppose A € R™*" is stored by column in A.col(1:mn). Assume that rn = £, M 
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and n = fN and that we regard A as an M-by-N block matrix with £1-by-£? blocks. 
Given i, j, a, and B that satisfy 1 < i < fi, 1 Ej XA, 1a M;and LS B X N 
determine k so that A.col(k) houses the (i,j) entry of Anp. Give an algorithm that 
overwrites A.col with A stored by block am im Figure 1.4.5. How big of a work array ia 
required? 


Notes and References for Sec. 1.4 
Two excellent expositions about vector computation are 


J.J. Dongarra, F.G. Gustavson, and A. Karp (1984). “Implementing Linear Algebra 
Algorithms for Dense Matrices on a Vector Pipeline Machine,” SJAM Review 26, 
91-112. 

J.M. Ortega and R.G. Voigt (1985). “Solution of Partial Differential Equations on Vector 
and Parallel Computers,” SIAM Review 27, 149-240. 


A very detailed look at matrix computations in hierarchical memory systems can be 
found in 


K. Gallivan, W. Jalby, U. Meier, and A.H. Sameh (1988). “Impact of Hierarchical Mem- 
ory Systems on Linear Algebra Algorithm Design," Int'l J. Supercomputer Applic. 
2, 12-48. 


See also 


W., Schónauer (1987). Scientific Computing on Vector Computers, North Holland, Am- 
sterdam. 

R.W. Hockney and C.R. Jesshope (1988). Parallel Computers 2, Adam Hilger, Bristol 
and Philadelphia. 

where various models of vector processor performance are set forth. Papers on the prac- 

tical aspects of vector computing include 


J.J. Dongarra and A. Hinda (1979). “Unrolling Loops in Fortran," Software Practice 
and Experience 9, 219-229. 

J.J. Dongarra and S. Eisenstat (1984). “Squeezing the Most Out of an Algorithm in 
Cray Fortran,” ACM Trans. Math. Soft. 10, 221-230. 

B.L. Buzbee (1986) “A Strategy for Vectorization,” Parallel Computing 3, 187-192. 

K. Gallivan, W. Jalby, and U. Meier (1987). “The Use of BLASS in Linear Algebra oa a 
Parallel Procemor with a Hierarchical Memory,” SIAM J. Sci. and Stat. Comp. 8, 
1079-1084. 

J.J. Dongarra and D. Walker (1995). “Software Libraries for Linear Algebra Compute- 
tions on High Performance Computers,” SIAM Review 37, 151-180. 


Chapter 2 


Matrix Analysis 


$2.1 Basic Ideas from Linear Algebra 

82.2 Vector Norms 

§2.3 Matrix Norms 

§2.4 Finite Precision Matrix Computations 
§2.5 Orthogonality and the SVD 

§2.6 Projections and the CS Decomposition 
§2.7 The Sensitivity of Square Linear Systems 


The analysis and derivation of algorithms in the matrix computation 
area requires a facility with certain aspects of linear algebra. Some of the 
basics are reviewed in §2.1. Norms and their manipulation are covered in 
§2.2 and §2.3. In §2.4 we develop a model of finite precision arithmetic and 
then use it in a typical roundoff analysis. 

The next two sections deal with orthogonality, which has a prominent 
role to play in matrix computations. The singular value decomposition 
and the CS decomposition are a pair of orthogonal reductions that provide 
critical insight into the important notions of rank and distance between 
subspaces. In $2.7 we examine how the solution of a linear system Az = 
b changes if A and 6 are perturbed. The important concept of matrix 
condition is introduced. 


Before You Begin 
References that complement this chapter include Forsythe and Moler 


(1967), Stewart (1973), Stewart and Sun (1990), and Higham (1996). 
2.1 Basic Ideas from Linear Algebra 


This section is a quick review of linear algebra. Readers who wish a more 
detailed coverage should consuit the references at the end of the section. 


AQ 


9.1. Basic IDEAS FROM LINEAR ALGEBRA 49 


2.1.1 Independence, Subspace, Basis, and Dimension 


A set of vectors {a,,...,@,} in R™ is linearly independent MY aja; =0 
implies a(1:n) = 0. Otherwise, a nontrivial combination of the a; is zero 
and {a1,...,@n} is said to be linearly dependent . 

A subspace of IR" is a subset that is also a vector space. Given a 
collection of vectors a1,...,a4 € R™, the set of al! linear combinations of 
these vectors is a subspace referred to as the span of (a1,..., a4): 


span(a;,...,a4) = (325 : Bj € R} . 
jml 
If (a1,...,84) is independent and b € span[a;,..., Gn}, then b is a unique 
linear combination of the aj. 

If 5,,..., Sk are subspaces of IR", then their sum is the subspace defined 
by § = {ai tagt+---+ay sa; € S, i= Lk ). S is said to bea direct sum 
if each v € S has a unique representation v = aj + --- +a, with a; € Sj. 
In this case we write S = 54 @---@ Se. The intersection of the 5; is also 
a subspace, S = NAKNA- n S. 

The subset (a4,...,a;,) is a marimal linearly independent subset of 
{a,,.-., Gn} ifit is linearly independent and is not properly contained in any 
linearly independent subset of {a1,...,an}. If (ai,,....a,,) is maximal, 
then span(a;,...,an) = span{aj,,...,a;,} and (a;,...,ai,] is a basis 
for span{a,,...,a,} . If 5 C IR" is a subspace, then it is possible to find 
independent basic vectors à4,...,ay € S such that 5 = span{a;,... ak} . 
All bases for a subspace S have the same number of elements. This number 
is the dimension and is denoted by dim(5). 


2.1.2 Range, Nuit Space, and Rank 


There are two important subspaces associated with an m-by-n matrix A. 
The range of A is defined by 


ran(A) = (y € R” : y = Az for some z € IR"), 
and the null space of A is defined by 
null(A) = {z € E? : Az = 0). 
If A =[a1,...,a,] is a column partitioning, then 
ran(.A) = span{a,...,@,} . 
The rank of a matrix A is defined by 
rank(A) = dim (ran(A)). 


It can be shown that rank( A) = rank(AT). We say that A € K'"*" is rank 
deficient if rank(A) < min(m, n). If A € R™*®, then 


dim(null(A)) + renk(A) = n. 
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2.1.3 Matrix Inverse 
The n-by-n identity matriz I, is defined by the column partitioning 


I = [€15--++€n] 
where e; is the kth "canonical" vector: 


eg = (0,...,0,1, 0,...,0)7. 
—M—M ——— 
k-1 n—k 


The canonical vectors arise frequently in matrix analysis and if their di- 
mension is ever ambiguous, we use superscripts, i.e., eta e R^. 

If A and X are in IR"™" and satisfy AX = I, then X is the inverse of 
A and is denoted by A^!. If A^! exists, then A is said to be nonsingular. 
Otherwise, we say A is singular. 

Several matrix inverse properties have an important role to play in ma- 
trix computations. The inverse of a product is the reverse product of the 
inverses: 


(AB)! = B7lA^!. (2.11) 
The transpose of the inverse is the inverse of the transpose: 
(ATH? = (AT) 1 = cT, (2.1.2) 
The identity 
B^! = A"! - B (B AA} (2.1.3) 


shows how the inverse changes if the matrix changes. 
The Sherman-Morrison- Woodbury formula gives a convenient expres- 
sion for the inverse of (A-- UV) where A € R°™" and U and V are n-by-k: 


(A4 UVT)7! = A71 — A71U(E VT A-1U)-1VT 4-1, (2.1.4) 


A rank k correction to a matrix results in a rank k correction of the inverse. 
In (2.1.4) we assume that both A and (f + VT A-!U) are nonsingular. 

Any of these facts can be verified by just showing that the “proposed” 
inverse does the job. For example, here is how to confirm (2.1.3): 


B(A" - B?(B- AA) B4! - (B- AM"! - I. 


2.1.4 The Determinant 


If A = (a) e R'™!, then its determinant is given by det(A) = a. The 
determinant of A c EU *" is defined in terms of order n — 1 determinants: 


det(A) = S(-1)aydet( dis) 


jel 
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Here, Aq; is an (n — 1)-by-(n — 1) matrix obtained by deleting the first row 
and jth column of A. Useful properties of the determinant include 


det(AB) = det(AjMe(B) A,Be RM" 
det(AT) = det(A) AeR™ 
det(cA) = det(A) ceR Ack” 
det(A) 40 «» Ais nonsingular Ae RO" 


2.1.5 Differentiation 


Suppose a is a scalar and that A(a) is an m-by-n matrix with entries a;;(). 
If a;,(a) is a differentiable function of a for all i and j, then by A(a) we 
mean the matrix 


Àla) = zz Ala) = (Zas(a)) = (yla). 


The differentiation of a parameterized matrix turns out to be a handy way 
to examine the sensitivity of various matrix problems. 


Problems 


P2.1.1 Show that if A c R™*" has rank p, then there exists an X & R™*? and a 
Y € RO”? such that A = XYT, where rank(X) = rank(Y) = p. 

P2.1.2 Suppose A(a) c R™*" and B(a) € R'"" are matrices whose entries are differ- 
entiable functions of the scalar a. Show 


as Abt) = [Fata] Ba) + ata [EB] . 
P2.1.3 Suppose A(a) € R'*" haa entries that are differentiable functions of the scalar 
a. Assuming A(o} is always noasingular, show 
as [Aa] = -Aa [ater] At". 


P2.1.4 Suppose A€ F^*^, be RM and that $(z) = zT Az — zTb. Show that the 
gradient of ¢ is given by Vé(z) = (AT + Aje — b, 

P2.1.5 Assume that both A and /--uvT are nonsingular where A € R*** and t, v € Kc 
Show that if x solves (A + us? jz = b, then it also solves a perturbed right band side 
problem of the form Az = 6+ ow. Give an expression for a in terms of A, u, and v. 


Notes and References for Sec. 2.1 


There are many introductory linear algebra texts. Among them, the following are par- 
ticularly useful: 


P.R. Halmos (1958). Finite Dimensional Vector Spaces, 2nd ad., Van Nostrand-Reinhold, 
Princeton. 
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S.J. Leon (1980). Linear Algebra with Applications. Macmillan, New York. 
G. Strang (1993). Introduction to Linear Algebra, Wellesley-Cambridge Preas, Wellesley 


D. Lar (1994). Linear Algebra and fts Applications, Addison-Wesley, Reading, MA. 
C. Meyer (1997). A Course in Applied Linear Algebra, SIAM Publications, Philadelphia, 
PA. 


More advanced treatments include Gantmacher (1059), Horn and Johnson (1985, 1991), 
and 


A.S. Householder (1964). The Theory of Matrices in Numerical Analysis, Ginn (Blais- 
dell), Boston. 

M. Marcus and H. Mine (1964). A Survey of Matriz Theory and Matriz Inequalities, 
Allyn and Bacon, Boston. 

J.N. Franklin (19683). Matriz Theory Prentice Hall, Englewood Cliffs, NJ. 

R- Bellman (1970). /ntroductson to Matriz Analysis, Second Edition, McGraw-Hill, New 
York. 

P. Lancaster and M. Tismenetsky (1985). The Theory of Matrices, Second Edition, 
Academic Press, New York. 

J.M. Ortega (1987). Matriz Theory: A Second Course, Plenum Presa, New York. 


2.2 Vector Norms 


Norms serve the same purpose on vector spaces that absolute value does 
on the real line: they furnish a measure of distance. More precisely, IR" 
together with a norm on R“ defines a metric space. Therefore, we have the 
familiar notions of neighborhood, open sets, convergence, and continuity 
when working with vectora and vector-valued functions. 


2.2.1 Definitions 


A vector norm on R” is a function f:R” — R that satisfies the following 
properties: 


f(z) > 6 ` reR’”, (f(x) = 0 iff z = 0) 
f(z-y) S f(z)* f(y ryen” 
flar) = |alf(z) acRzem 


We denote such a function with a double bar notation: f(z) = || x ||. Sub- 
scripts on the double bar are used to distinguish between various norms. 
A useful class of vector norms are the p-norms defined by 


| zl, = (niim = pe. (2.2.1) 
Of these the 1, 2, and oo norms are the most important: 


Nol, = mu dns] 
izle = (Inf lra)? = (272)? 
lc Heo = max |x| 

I<ign 
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A unii vector with respect to the norm || - || is a vector r that satisfies 
(zi =1 


2.2.2 Some Vector Norm Properties 
A classic result concerning p-norms is the Holder inequality: 


1 1 
izTw € Izl, lvl, -+-=1. (2.2.2) 
P q 
A very important special case of this is the Cauchy-Schwartz inequality. 
Iz7yl S Iz all y la (2.2.3) 
All norms on IR^ are equivalent , i.e., if || © ||, and || - ||g are norms on 


R”, then there exist positive constanta, c; and cz such that 
el zi, S {tly S eall z i, (2.2.4) 


for all z c R”. For example, if z € R”, then 


lzl € (zh S vriz ll . (2.2.5) 
luz X izl: S yrz ll (2.2.6) 
lz. € Ith Snlzll. (2.2.7) 


2.2.3 Absolute and Relative Error 


Suppose 2 € IR" is an approximation to r € IR”. For a given vector norm 
I} || we say that 

fobs = || 2-2] 
is the absolute error in i. If z #0, then 


sa = lê-zl 
"= Tel 


prescribes the relative error in i. Relative error in the oo-norm can be 
translated into a statement about the number of correct, significant digits 
in £. In particular, if 


then the largest component of 2 has approximately p correct significant 
digits. 


Example 2.2.1 Iz = (1.234 .05674)T and 2 = (1.235 .05128)7, then 2 — = [LL] = lle 
= .0043 3 10-7. Note than 2) haz about three significant digits that are correct while 
only one significant digit in 22 is correct. 
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2.2.4 Convergence 
We say that a sequence (z(*)) of n-vectors converges to z if 


lim || 2) —zj| 2 0. 
kon 


Note that because of (2.2.4), convergence in the a-norm implies convergence 
in the -norm and vice versa. 


Problems 


P2.2.1 Show that if z € R^, then lim, x [iz ||, = ilz ils. 

P2.2.3 Prove the Canchy-Schwarts inequality (2.2.3) by considering the inequality 
0 < (az + by)? (az + by) for suitable scalam a and b . 

P2.2.3 Verify that || - ||), || : lz, and || - ||, are vector norms. 

P2.2.4 Verify (2.2.5)-(2.2.7). When is equality achieved in each result? 

P2.2.5 Show that in R^, z — z if and only if z(Ü — z, for k = Lin. 

P2.2.8 Show that any vector norm on A" is uniformly continuous by verifying the 
inequality | | z] - ly lI S ilz- y I- 

P2.2.7 Let ||- || be a vector norm on R™ and amume A € R™** | Show that if 
rank(A) = n, then |j z[|4 = || Az || is a vector norm on R”. 


P2.2.8 Let zr and y be in R” and define y: R. — R by (a) = [[z — ay lz. Show that 
y is minimized when a = 27 y/yT y. 

P2.2.9 (a) Verify thas || = lly = (lzil?+---+[enl?)? is vector norm on €". (b) Show 
that if ce ©” then iz}, < c(l Retz) ll, + ll Im(z) lp): (c) Find a constant cn such 
that en (|| Re(z) |a + il Em(z) fla) € I| z ila for all ze €^. 

P2.2.10 Prove or disprove: 


l4 n 
DERM = Bully too S Ev 


Notes and References for Sec. 2.2 


Although a vector norm is "just" a generalization of the absolute value concept, there 
are some noteworthy subtleties: 


J.D. Pryce (1884). "A New Measure of Relative Error for Vectors,” SIAM J. Num. 
Anal, 21, 202-21. 


2.3 Matrix Norms 


The analysis of matrix algorithms frequently requires use of matrix norms. 
For example, the quality of a linear system solver may be poor if the ma- 
trix of coefficients is “nearly singular.” To quantify the notion of near- 
singuiarity we need a measure of distance on the space of matrices. Matrix 
norms provide that measure. 
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2.3.1 Definitions 


Since R™™*" is isomorphic to R'™*, the definition of a matrix norm should be 
equivalent to the definition of a vector norm. In particular, f:R™*" — R 
is a matrix norm if the following three properties hold: 


f(A} z 0 AER™", (f(A) = 0 iff A =0) 
f(A+B) x f(A) - (B) A,B ERY”, 
f(aA) = lal f(A) aéR,AeR™*. 


As with vector norms, we use a double bar notation with subscripts to 
designate matrix norma, i.e., || A || = f(A). 
The most frequently used matrix norms in numerical linear algebra are 


the Frobenius norm, 
m n 
lAle = [030 esl (2.3.1) 
i=l j=l 


{| Ax |, 
lal, 


Note that the matrix p-norms are defined in terms of the vector p-norms 
that we discussed in the previous section. The verification that (2.3.1) and 
(2.3.2) are matrix norms is left ag an exercise. It is clear that || A ||, is 
the p-norm of the largest vector obtained by applying A to a unit p-norm 


vector: 
T 
4 (x) y 


It is important to understand that (2.3.1) and (2.3.2) define families 
of norms—-the 2-norm on R°™? is a different function from the 2-norm on 
RŽ”, Thus, the easily verified inequality 


and the p-norms 


| Al, = (2.3.2) 


Alp = sup max || Ari. 
=O 


tal, bz 


(ABI, I AN,IBI, | Aem" Ber (2.3.3) 


is really an observation about the relationship between three different norms. 
Formally, we say that norms fi, fz, and fy on R™**, R", and R^** are 
mutually consistent if for all A c R™" and B € RE** we have f,(AB) < < 
£2(A) fa(B). 

Not all matrix norms satisfy the submultiplicative property 


{ABI S WATE. (2.3.4) 
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For example, if || A ||, = max |a,;| and 


1 i 
A-2-|i d 


then || AB ||, > HA llai B Ila. For the most part we work with norms that 
satisfy (2.3.4). 

The p-norms have the important property that for every A € R™*? and 
z € R" we have || Az || 5 A lll z |p- More generally, for any vector 
som II ou He anf | |g on B we have Az], © Allez le 
where || A ||, g is a matrix norm defined by 


WAlleg sup aris (2.3.5) 
j zx0 Izla 
We say that || - lla g is subordinate to the vector norms || - |, and ]| - la- 
Since the set {T € ^f. l| = ll = 1} is compact and || - |a is continuous, it 
follows that 
Atlas = "- l Ax |a = Il At" Ig (2.3.6) 


for some z* € R” having unit a-norm. 


2.3.2 Some Matrix Norm Properties 


The Frobenius and p-norms (especially p = 1, 2, oo) satisfy certain inequal- 
ities that are frequently used in the analysis of matrix computations. For 
A € R™*? we have 


lAl < Alp < Val Als (2.3.7) 

mex lal $NA lS vA max jas (2.3.8) 

lAl = ymax x Eles] (2.3.9) 

[Al = max < Eies (2.3.10) 
iiem i 

Fl Alo S lAl < VENA lo (2.3.11) 


alah € [Alla S vni All, (2.3.12) 
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IffAc R™", 1< i) €i Xm, and 1 < j S ja S n, then 
I Afia 31:32) Hl, <All, (2.3.13) 


The proofs of these relations are not hard and are left as exercises. 
A sequence (A9) e R™*" converges if lim... | A-A} = 0. 
Choice of norm is irrelevant since all norms on IR"*" are equivalent. 


2.8.3 The Matrix 2-Norm 


A nice feature of the matrix 1-norm and the matrix co-norm is that they 
are easily computed from (2.3.9) and (2.3.10). A characterization of the 
2-norm is considerably more complicated. 


Theorem 2.3.1 If A c R™"", then there existis a unit 2-norm n-vector z 
such that AT Az = p?z where p = || A lla. 


Proof. Suppose z € R” is a unit vector such that || Az jja = || A jla. Since 
z maximizes the function 


(a) = 11213, LAAs 
9773 zi] 2 zx 


it follows that it satisfies Vg(z) = 0 where Vg is the gradient of g. But a 
tedious differentiation shows that for i = l:n 


858 = Gr y a aus, - a ans] / (Fs). 


j=l 


In vector notation thia says AT Az = (z7 AT Az)z. The theorem follows by 
setting p= || Az ||. 0 


The theorem implies that || A[J is a zero of the polynomial p(À) = 
det(AT A — AI). In particular, the 2-norm of A is the square root of the 
largest eigenvalue of A7 A. We have much more to say about eigenvalues in 
Chapters 7 and 8. For now, we merely observe that 2-norm computation 
is iterative and decidedly more complicated than the computation of the 
matrix i-norm or co-norm. Fortunately, if the object is to obtain an order- 
of-magnitude estimate of [| A [[2, then (2.3.7), (2.3.11), or (2.3.12) can be 
used. 

As another example of "norm analysis,” here is a handy result for 2- 
norm estimation. 


Corollary 2.3.2 If AE R^*^, then || Alo € Vl All A ss - 


Proof. If z #0 is such that AT Az = pêz with u = | A Bay then wl e = 
WAT Ari, < PAT IMAM lel, =H Ale IA lille ly. 0 
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2.3.4 Perturbations and the Inverse 

We frequently use norms to quantify the effect of perturbations or to prove 
that a sequence of matrices converges to a specified limit. As an illustration 
of these norm applications, let us quantify the change in A`! as a function 
of change in A. 

Lemma 2.3.3 If Fe R'*" and | Fil, < 1 , then I — F is nonsingular 


ond wo 
ü-Fy'-2yr 
ke 
with 
IG - Fs S LÁ 
ae | F ll, 
Proof. Suppose / — F is singular. It follows that (J — F)r = 0 for some 
nonzero x. But then || z ||, = || Fx || implies || F |], 7 1, a contradiction. 


Thus, I — F is nonsingular” To obtain an expression for its inverse consider 


the identity 
N 
(5 r) U- FP) = -F+ 
ke0 


Since || F ||, < 1 it follows that im F* = 0 because || F* lp £ II F I. 
— 
Thus, 


N 
It follows that (7 — F)'' = lim $ ' F*. From this it is easy to show that 
No 5 


I-P «Yirg - DILDO 


Note that (IF) - 1], < lFIL/G — IF lp) a a consequence 


of the lemma. Thus, if € < 1, then Oe perturbations in 7 induce O(e) 
perturbations in the inverse. We next extend this result to general matrices. 


Theorem 2.3.4 if A is nonsingular and r = | A? E ||, < 1, then A+ E 
is nonsingular and || (A+ E)! — A7! |, < Ell, A7 I/0 — r) 
Proof. Since A is nonsingular A+ E = A(I — F) where F = —A^!E. 
Since || F |p = r <1 it follows from Lemma 2.3.3 that J—F ia nonsingular 
and (I - F)-! l, < 1/(1— r). Now (A E)! = (I — F)-1A-! and so 
iA- Il, 

l-r C 


IA +E) M, $ 
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Equation (2.1.3) says that (A+ E)7! — AT! = —A"! E(A + E)! and so 
by taking norms we find 
| (A E)7! - A7 Il, 


^ 


i A^! ip IL ERI CA - E)" I, 


-i 42 
| A= I, E US o 
l-r 


Problems 


P2.3.1 Show | AB ll, SHA, i Bl, where 1 < p< co, 

P2.3.2 Let B be any submatrix of A. Show that || B il, < l| A. 

P2.3.3 Show that if D = diag(,..., nk) € KC" *" with k = min(m,n], then || D fj, 
= max |]. 

P2.3.4 Verify (2.3.7) nnd (2.3.8). 

P2.3.5 Verify (2.3.9) and (2.3.10). 

P2.3.6 Verify (2.3.11) and (2.3.12). 

P2.3.7 Verify (2.3.13). 

P2.3.8 Show that if 0 4 s c R* and E c E^*", then 


T 
as 
P2.3.9 Suppose u € R™ and v € R”. Show that if E = ueT then | E] = || Ella = 
| æ lll v iiz and thet [LE f, S I ullas vll 
P2.3.10 Suppose A € R™*", y € R™, and 0 # s € R”. Show that E = (y~ As)47 /sT s 
has the smallest 2-norm of all m-by-n matrices E that satisfy (A+ E)a = y. 


| Bs I3 
aT. C 


2 
2 

= |El - 

F 


Notes and Referentes for Sec. 2.3 
For deeper imues concerning matrix/vector norms, soe 


F.L. Bauer and C.T. Fike (1960). “Norms and Exclusion Theorems,” Numer. Math. $, 
137-44. 

L. Mirsky (1960). “Symmetric Gauge Functions and Unitarily Invariant Norma," Quart. 
J. Math. 11, 50-59. 

A.S. Householder (1964). The Theory of Matrices in Numerical Analysis , Dover Pub- 
licationa, New York. 

N.J. Higham (1992). “Estimating the Matrix p-Norm," Numer. Math 62, 539-556. 


2.4 Finite Precision Matrix Computations 


In part, rounding errors are what makes the matrix computation area so 
nontrivial and interesting. In this section we set up a model of floating point 
arithmetic and then use it to develop error bounds for floating point dot 
products, saxpy’s, matrix-vector products and matrix-matrix products. For 
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a more comprehensive treatment than what we offer, see Higham (1996) or 
Wilkinson (1965). The coverage in Forsythe and Moler (1967) and Stewart 
(1973) is also excellent. 


2.4.1 The Floating Point Numbers 


When calculations are performed on a computer, each arithmetic opera- 
tion is generally affected by roundoff error. This error arises because the 
machine hardware can only represent a subset of the real numbers. We 
denote this subset by F and refer to its elements as floating point numbers. 
Following conventions set forth in Forsythe, Malcolm, and Moler (1977, pp. 
10-29), the floating point number system on a particular computer is char- 
acterized by four integers: the base B, the precision t, and the exponent 
range [L, U]. In particular, F consists of all numbers f of the form 


f = okdQdy...d.x f* 0 € d, <B, di #0, Liex<u 


together with zero. Notice that for a nonzero f € F we have m < |f| < M 
where 


m= 6"! and M -8"(- 87"). (2.4.1) 


As an example, if 8 = 2, t = 3, L = 0, and U = 2, then the non-negative 
elements of F are represented by hash marks on the axis displayed in FIG. 
2.4.1. Notice that the floating point numbers are not equally spaced. A 


FIGURE 2.4.1 Sample Floating Point Number System 


typical value for (B, t, L,U) might be (2, 56, -64, 64). 


2.4.2 A Model of Floating Point Arithmetic 
To make general pronouncements about the effect of rounding errors on a 


given algorithm, it is necessary to have a model of computer arithmetic on 
F. To this end define the set G by 


G=e{reR:m< |r| <M }u {0} (2.4.2) 
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and the operator fl: G —^ F by 


fiz) = nearest c € F to z with ties handled 
z = by rounding away from zero. 


The fI operator can be shown to satisfy 
fliz)=z(i+e)  |e| Su (2.4.3) 
where u is the unit roundoff defined by 


u = port, (2.4.4) 


Let a and b be any two floating point numbers and let "op" denote any 
of the four arithmetic operations +, —, x, +. Ifaop b € G, then in our 
model of floating point arithmetic we assume that the computed version of 
(a op b) is given by fl(a op b). It follows that fl(a op b) = (a op b)(1 + €) 
with |e] € u. Thus, 


(a op b) - (a op b)i 
a op 5| 


IA 


u aopbž0 (2.4.5) 


showing that there is small relative error associated with individual arith- 
metic operations!. It is important to realize, however, that this is not 
necessarily the case when a sequence of operations is involved. 


Example 2.4.1 If 8 = 10, t = 3 floating point arithmetic is used, then it can be shown 
that fi[/1(107* + 1) — 1] = 0 implying a relative error of 1. On the other hand the 
exact answer i» given by fi[fl(10-4* + fi(1— 1)] = 1074. Floating point arithmetic is 
not always associative. 


If a op b ¢ G, then an arithmetic exception occurs. Overflow and 
underflow results whenever |a op b] > M or 0 < [a op b| < m respectively. 
The handling of these and other exceptions is hardware/system dependent. 


2.4.3 Cancellation 


Another important aspect of finite precision arithmetic is the phenomenon 
of catastrophic cancellation. Roughly speaking, this term refers to the ex- 
treme loss of correct significant digits when small numbers are additively 
computed from large numbers. À well-known example taken from Forsythe, 
Maicolm and Moler (1977, pp. 14-18) is the computation of e~* via Tay- 
lor series with a > 0. The roundoff error associated with this method is 

VThere are important examples of machines whose additive floating point operations 
satisfy fl(a +b) = (1+ e)a + (1 + eg)d where je1|, |e2| < u. In such an environment, 
the inequality |/H(a + b) — {a + 5)| < ula + & need not hold. 
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approximately u times the largest partial sum. For large a, this error can 
actually be larger than the exact exponential and there will be no correct 
digita in the answer no matter how many terms in the series are summed. 
On the other hand, if enough terms in the Taylor series for e* are added and 
the result reciprocated, then an estimate of e~* to full precision is attained. 


2.4.4 The Absolute Value Notation 


Before we proceed with the roundoff analysis of some basic matrix calcu- 
lations, we acquire some useful notation. Suppose A € IR"*" and that we 
wish to quantify the errors associated with its floating point representation. 
Denoting the stored version of A by /1(A), we see that 


[SHA = flai) = a(l teg) — leg] £u (2.4.6) 


for all í and j. A better way to say the same thing results if we adopt two 
conventions. If A and B are in R™*", then 


B= JA] > bj = Jacl, i= lim, 3 -Lbn 
BZA > b;Xaji-km j=ln. 
With this notation we see that (2.4.6) has the form 
(f(A) - Al S ulAl. 
À relation such as this can be easily turned into a norm inequality, e.g., 
IKA) - Al, s ull All- However, when quantifying the rounding errors 


in a matrix manipulation, the absolute value notation can be a lot more 
informative because it provides a comment on each (i, j) entry. 


2.4.5 Roundoff in Dot Products 


We begin our study of finite precision matrix computations by considering 
the rounding errors that result in the standard dot product algorithm: 


s=Q 
for k= i:n 

88d IkYk (2.4.7) 
end 


Here, z and y are n-by-1 floating point vectors. 

In trying to quantify the rounding errors in this algorithm, we are 
immediately confronted with a notational problem: the distinction be- 
tween computed and exact quantities. When the underlying computations 
are clear, we shall use the fl(-) operator to signify computed quantities. 
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Thus, fi(z y) denotes the computed output of (2.4.7). Let us bound 


Ul(zT y) - 27 yl. E R 
= fl (Zo) ' 
kal 


then 5; = zıyı(l + 1) with |61| € u and for p = 2:n 


fl(sp-1 + fl(zpyg)) 
= (Spitzpyp(lt+s))(1+ep) — léjle £u. (24.8) 


Sp 


A little algebra shows that 


fi(zTy) = sn = 9 zen(1 +7) 


kmi 

where 

nm 

(1+) = (1 +6) [[a +6) 

jek 

with the convention that e, = 0. Thus, 
[fi(z7y) - 27 y| Y ieee rel (2.4.9) 
kzi 


To proceed further, we must bound the quantities |-y,| in terms of u. The 
following result is useful for this purpose. 


n 
Lemma 2.4.1 If {(1+a) = IIa-oo where |ay| € u and nu € .01, then 


kmi 
la| < 1.01nu. 


Proof. See Higham (1996, p. 75). O 
Applying this result to (2.4.9) under the “reasonable” assumption nu < .01 
gives 

[fl(a7y) - 27 y| < 1.01nulz|* ly. (2.4.10) 
Notice that if [z7y| «& |z|7 |y|, then the relative error in fl(z7y) may not 
be small. 


2.4.0 Alternative Ways to Quantify Roundoff Error 


An easier but less rigorous way of bounding œ in Lemma 2.4.1 is to say 
l| € nu + O(u?). With this convention we have 


|fl(zT y) — zTy| x nulzi? ly] + O(u?). (2.4.11) 
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Other ways of expressing the same result include 


LT y) — zTy| < é(n)ulz(Tiy| (2.4.12) 


and 
Mi(zT y) — zTy| < enulzi? |y], (2.4.13) 


where in (2.4.12) ó(n) is a “modest” function of n and in (2.4.13) c is a 
constant of order unity. 

We shall not express a preference for any of the error bounding styles 
shown in (2.4.10)-(2.4.13). This spares us the necessity of translating the 
roundoff results that appear in the literature into a fixed format. Moreover, 
paying overly close attention to the details of an error bound is inconsistent 
with the “philosophy” of roundoff analysis. As Wilkinson (1971, p. 567) 
says, 


There is still a tendency to attach too much importance to the 
precise error bounds obtained by an à priori error analysis. In 
my opinion, the bound itself is usually the least important part 
of it. The main object of such an analysis is to expose the 
potential instabilities, if any, of an algorithm so that hopefully 
from the insight thus obtained one might be led to improved al- 
gorithms, Usually the bound itself is weaker than it might have 
been because of the necessity of restricting the mass of detail 
to a reasonable level and because of the limitations imposed by 
expressing the errora in terms of matrix norms. À priori bounds 
are not, in general, quantities that should be used in practice. 
Practical error bounds should wsually be determined by some 
form of à posteriori error analysis, since this takes full advan- 
tage of the statistical distribution of rounding errors and of any 
special features, such as sparseness, in the matrix. 


It is important to keep these perspectives in mind. 


2.4.7 Dot Product Accumulation 


Some computers have provision for accumulating dot products in double 
precision. This means that if z and y are floating point vectors with length 
t mantissas, then the running sum s in (2.4.7) is built up in a register with 
a 2t digit mantissa. Since the multiplication of two t-digit floating point 
numbers can be stored exactly in a double precision variable, it is only 
when s is written to single precision memory that any roundoff occurs. In 
this situation one can usually assert that a computed dot product has good 
relative error, i.e., fl(zTy) = zT y(1-- 6) where |5| = u. Thus, the ability 
to accumulate dot products is very appealing. 
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2.4.8 Roundoff in Other Basic Matrix Computations 


It is easy to show that if A and B are floating point matrices and a is a 
floating point number, then 


fl(aA) =aAt+E |E| € ujaA| (2.4.14) 
and 
fUA+ B)-(A-B)-E | E] € ulA + BI. (2.4.15) 


Às a consequence of these two results, it is easy to verify that computed 
saxpy's and outer product updates satisfy 


f(az-y)sozty-z  |z €u(2ozxi-|y)- O(u?) — (2.4.16) 


f(CruvT)-CewT-E  |E| su(|Cl + 217]) 4 O(u?). (2.4.17) 


Using (2.4.10) it is easy to show that a dot product based multiplication of 
two floating point matrices A and B satisfies 


fUAB)=AB+E — |E| <nulAl|B] + O(u?). (2.4.18) 


The same result applies if a gaxpy or outer product based procedure is used. 
Notice that matrix muitiplication does not necessarily give small relative 
error since |AB| may be much smaller than |A||B|, e.g., 


11 10] [010 
0 0 -9 01 | 0 op 
It is easy to obtain norm bounds from the roundoff results developed thus 
far. If we look at the 1-norm error in floating point matrix multiplication, 
then it is easy to show from (2.4.18) that 
I £(AB) - ABl, < null Allyl Bll, O(u?). (2.4.19) 


2.4.9 Forward and Backward Error Analyses 


Each roundoff bound given above is the consequence of a forward error 
analysts. An alternative style of characterizing the roundoff errors in an 
algorithm is accomplished through a technique known as backward error 
analysis. Here, the rounding errors are related to the data of the problem 
rather than to its solution. By way of illustration, consider the n = 2 
version of triangular matrix multiplication. It can be shown that: 


anbu(lce) (anbis(1-- 61)  aiba(1 + e3))(1 + €4) 
FKAB) = 
0 422023(1 + ¢5) 
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where |e;| < u, for i = 1:5. However, if we define 
. | an a(l + ez)(1 e) | 
A= 


0 an(1 + es) 
and 


| bull + €i) bia(1 T eg) (1 + €4) | 


0 baz 
then it is easily verified that fi{AB) = AB. Moreover, 
A=A+E  |E| £2ulA| + O(u?) 
Ê=B+F_  |F| £2u|B| + O(u?). 


In other words, the computed product is the exact product of slightly per- 
turbed A and B. 


2.4.10 Error in Strassen Multiplication 


In §1.3.8 we outlined an unconventional matrix multiplication procedure 
due to Strassen (1969). It is instructive to compare the effect of roundoff 
in this method with the effect of roundoff in any of the conventional matrix 
multiplication methods of 81.1. 

It can be shown that the Strassen approach (Algorithm 1.3.1) produces 
a C = fl(AB) that satisfies an inequality of the form (2.4.19). This is 
perfectly satisfactory in many applications, However, the Ê that Strassen's 
method produces does not always satisfy an inequality of the form (2.4.18). 
'To see this, suppose 

29 .0010 
A - B - | go 99 | 

and that we execute Algorithm 1.3.1 using 2-digit floating point arithmetic. 
Among other things, the following quantities are computed: 


fi(.99(.001 — .99)) = —.98 
ft((.99 + .001).99) = .98 

p = fü +P) =0.0 
Now in exact arithmetic c12 = 2(.001)(.99) = .00198 and thus Algorithm 1.3.1 
produces a é)2 with no correct significant digits. The Strassen approach gets 
into trouble in this example because amall off-diagonal entries are combined 


with large diagonal entries. Note that in conventional matrix multiplication 
neither 54 and byg nor a4; and aja are summed. Thus the contribution of 


Py 
Wou 
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the small off-diagonal elements is not lost. Indeed, for the above A and B 
a conventional matrix multiply gives ĉi = .0020. 

Failure to produce a componentwise accurate C can be a serious short- 
coming in some applications. For example, in Markov processes the aij, 
bij, and c;; are transition probabilities and are therefore nonnegative. It 
may be critical to compute cj; accurately if it reflects a particularly im- 
portant probability in the modeled phenomena. Note that if A > 0 and 
B > 0, then conventional matrix multiplication produces a product Ó' that 
has small componentwise relative error: 


IĜ —C| € nulAl |B| + Olu?) = nulC| + O(n’). 


This follows from (2.4.18). Because we cannot say the same for the Strassen 
approach, we conclude that Algorithm 1.3.1 is not attractive for certain 
nonnegative matrix multiplication problems if relatively accurate ĉ;j are 
required. 

Extrapolating from this discussion we reach two fairly obvious but im- 
portant conclusions: 


« Different methods for computing the same quantity can produce sub- 
stantially different results. 


+ Whether or not an algorithm produces satisfactory results depends 
upon the type of problem solved and the goals of the user. 


These observations are clarified in subsequent chapters and are intimately 
related to the concepta of algorithm stability and problem condition. 


Problems 


P2.4.1 Show that if (2.4.7) is applied with y = x, then fl(zTz) = z7z(1 +a) where 
lal < nu  O(u?). 

P2.4.2 Prove (2.4.3). 

P2.4.3 Show that if E c R"*" with m > n, then [E] lla < vai E i2- This result is 
useful when deriving norm bounds from absolute value bounds. 

P2.4.4 Assume the existence of a square root function satisfying fl(/z) = /z(1- «) 
with |e| < u. Give an algorithm for computing || z [; and bound the rounding errors. 
P2.4.5 Suppose A and B are n-by-n upper triangular fosting point matrices. If Ê = 
JAB) is computed using ona of the conventional $1.1 algorithms, does it follow that 
€ = AB where A and B are close to A and BT 

P2.4.6 Suppose A and B are n-by-n oeting point matrices and that A is nonsingular 
with | ATILA] loo = r. Show that if C = fl(AB) is obtained using sny of the 
algorithms in 81.1, then there exists a P so C = AB and 1 Ê — B fæ < nur] B [os + 
Otu?). . 

P2Z4.T Prove (2.4.18). 


68 CHAPTER 2. MATRIX ANALYSIS 


Notes and References for Sec. 2.4 
For a general introduction to the effecta of roundoff error, we recommend 


J.H. Wilkinson (1963). Rounding Errors in Algebraic Processes, Prentice-Hall, Engle- 
wood Cliffs, NJ. 

J.H. Wilkinson (1971). “Modern Error Analysis" SEAM Review 13, 548-68. 

D. Kahaner, C.B. Moler, and S. Nash (1988). Numerical Methods and Software, Prentice- 
Hall, Englewood Clift, NJ. 

E. Chaitin-Chatelin and V. Frayseé (1996). Lectures on Finite Precision Computations, 
SIAM Publications, Philadelphia. 


More recent developments in error analysis involve interval analysis, the building of stæ 
tistical models of roundoff error, and the automating of the analysis itself: 


T.E. Hull and J.R. Swensen (1966). “Tests of Probabilistic Models for Propagation of 
Roundoff Errors,” Comm. ACM. 9, 108-13. 

J. Larson and A. Sameh (1978). “Efficient Calculation of the Effecta of Roundoff Exrors,” 
ACM Trans. Math. Soft. 4, 228-36. 

W. Miller and D. Spooner (1978). “Software for Roundoff Analysis, IL" ACM Trans. 
Math. Soft. 4, 369-90. 

J.M. Yohe (1979). “Software for Interval Arithmetic: A Reasonable Portable Package,” 
ACM Trans. Math. Soft. 5, 50-63. 


Anyone engaged in serious software development needs a thorough understanding of 
floating point arithmetic. A good way to begin acquiring knowledge in this direction is 
to read about the IEEE floating point standard in 


D. Goldberg (1991). “What Every Computer Scientist Should Know About Floating 
Point Arithmetic,” ACM Surveys 23, 5-48. 


See also 


RP. Brent (1978). “A Fortran Multiple Precision Arithmetic Package,” ACM Trana. 
Math. Soft. 4, 57-70. 

R.P. Brent (1978). “Algorithm 524 MP, a Fortran Multiple Precision Arithmetic Pack- 
age,” ACM Trans. Math. Soft. 4, 71-81. 

J.W. Demmel (1984). "Underflow and the Reliability of Numerical Software,” SIAM J. 
Sci and Stat. Comp. 5, 887-019. 

U.W. Kulisch and W.L. Miranker (1986). “The Arithmetic of the Digital Computer,” 
SIAM Review 28, 1-40. 

W.J. Cody (1988). "ALGORITHM 665 MACHAR: A Subroutine to Dynamicaily De 
termine Machine Parameters,” ACM Trans. Math. Soft. 14, 303-311. 

D.H. Bailey, H.D. Simon, J. T. Barton, M.J. Fouts (1989). “Floating Point Arithmetic 
in Future Supercomputers,” Intt J. Supercomputing Appl. 3, 86-00. 

D.H. Bailey (1993). "Algorithm 719: Multiprecision Translation and Execution of FOR- 
TRAN Programa" ACM Trans. Math Soft. 19, 288-319. 


The subtleties associated with the development of high-quality software, even for “sim- 
ple" problems, are immense. A good example in the design of a subroutine to compute 
2-norma 


J.M. Blue (1978). *A Portable FORTRAN Program to Find the Euclidean Norm of & 
Vector,” ACM Trans. Math. Soft. 4, 15-23. 


For an analysis of the Strassen algorithm and other “fast” linear algebra procedures sea 
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RP. Brent (1970). "Error Analysis of Algorithms for Matrix Multiplication and Trian- 
gular Decomposition Using Winograd’s Identity," Numer. Math. 16, 145-158. 
W. Miller (1975). “Computational Complexity and Numerical Stability" SIAM J. Com- 


puting 4, 97-107. 
N.J. Higham (1992). “Stability of a Method for Multiplying Complex Matrices with 
Three Ree] Matrix Multiplicationa,” SIAM J. Matriz Anal Appl 13, 681-687. 
J.W, Demmel and N.J. Higham (1992). “Stability of Block Algorithms with Fast Level-3 
BLAS,” ACM Trans. Math. Soft. 18, 274-291. 


2.5 Orthogonality and the SVD 


Orthogonality has a very prominent role to play in matrix computations. 
After establishing a few definitions we prove the extremely useful singular 
value decomposition (SVD). Among other things, the SVD enables us to 
intelligently handle the matrix rank problem. The concept of rank, though 
perfectly clear in the exact arithmetic context, is tricky in the presence of 
roundoff error and fuzzy data. With the SVD we can introduce the practical 
notion of numerical rank. 


2.5.1 Orthogonality 


A set of vectors (z1,...,2,] in R” is orthogonal if z7r; = 0 whenever 
i # j and orthonormal if z7z, = 6,;. Intuitively, orthogonal vectors are 
maximally independent for they point in totally different directions. 

A collection of subspaces 5,,...,5, in IR™ is mutually orthogonal if 
zTy = 0 whenever z € S; and y € 5; for i Æ j. The orthogonal complement 
of a subepace 5 C R" is defined by 


S1 = {ye R™: yTz =0 for all z € S) 


and it is not hard to show that ran(A)^ = null( AT). The vectors t,,.. . , Uk 
form an orthonormal basis for a subspace 5 C RR" if they are orthonormal 
and span S. 

A matrix Q c R™*™ is said to be orthogonal if QTQ — I. IEQ — 
[dis -+ -s gm ] is orthogonal, then the q; form an orthonormal basis for R™. 
It ig always possible to extend such a basis to a full orthonormal basis 
[vi,..., Um} for R™: 


Theorem 2.5.1 If V; € E^*" has orthonormal columns, then there exists 
VW e RE 072 such that 
V s [VW V4] 


is orthogonal. Note that ran(V;)* = ran(V;). 


Proof. This is a standard result from introductory linear algebra. It is 
also a corollary of the QR factorization that we present in $5.2. O 
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2.5.2 Norms and Orthogonal Transformations 


The 2-norm is invariant under orthogonal transformation, for if QTQ = I, 
then | Qr ||} = zTQTQr = zz = |z|. The matrix 2-norm and 
the Frobenius norm are also invariant with respect to orthogonal transfor- 
mations. In particular, it is easy to show that for all orthogonal Q and Z 
of appropriate dimensions we have 


ll QAZ le = | Alle (2.5.1) 


and 
QAZ |a = || A lla. (2.5.2) 


2.5.3 The Singular Value Decomposition 
The theory of norms developed in the previous two sections can be used to 
prove the extremely useful singular value decomposition. 
Theorem 2.5.2 (Singular Value Decomposition (SVD)) If A is a real 
m-by-n matriz, then there exist orthogonal matrices 

U =([un,...,.tm]E@R™™ and V -—[v,...,v,] e R" 
such that 

UT AV = dieg(si,...,05) € R"*" — p- min(m,n] 

where g1 > 72 2... 2 05 20. 


Proof. Let z € R” and y € R™ be unit 2-norm vectors that satisfy Az = 
oy with ø = || Ala. From Theorem 2.5.1 there exist V; c IRA (^71) and 
Uz c RO") so V = |z Vj] eR" and U = [y Uz] c R™™™ are 
orthogonal. It is not hard to show that UT AV has the following structure: 


Since 
> (se +07 w)? 


I (DI. 


2 
we have || A; |} > (o? + wTw). But o? = | AR = || Ay i3 , and so we 
must have w = 0. An obvious induction argument completes the proof of 
the theorem. Q 


The c, are the singular values of A and the vectors u; and s, are the 
ith left singular vector and the ith right singular vector respectively. It 
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is easy to verify by comparing columns in the equations AV = UZ and 
ATU = VET that 


AU = Oi : qon 
ATu = ow i = L:min([m,n) 


It is convenient to have the following notation for designating singular val- 
ues: 
a,(A) the ith largest singular value of A, 
Omaz{A) the largest singular value of A, 
Omin A) = the smallest singular value of A. 


The singular values of a matrix A are precisely the lengths of the semi-axes 
of the hyperellipsoid E defined by E = { Az: || z l2 =} }. 


Example 2.5.1 


96 1.72 


_f: _ r [6 -8]f3 077.8 #617 
A- | 228 se | UEV -[3 sito aie -3 | 


The SVD reveals a great deal about the structure of a matrix. If the 
SVD of A is given by Theorem 2.5.2, and we define r by 


Oy 2 Bp > Oey = Hy = 8, 
then 
rank(A) = r (2.5.3) 
nuill(4) = span(s.,1,...,v4) (2.5.4) 
` ran( A) = span(ui, 1.a ur} , (2.5.5) 


and we have the SVD erpansion 
r 
A= Some . (2.5.6) 
i=l 


Various 2-norm and Frobenius norm properties have connectiona to the 
SVD. If Ac R™*", then 


AIL = o2+---+02  p-mi(ma) (257) 
lahk = a (2.5.8) 
min | Az Ha = On {m > n). (2.8.9) 


z¥o zl 
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2.5.4 The Thin SVD 
If A = UEVT c R™*" is the SVD of A and m > n, then 


A-2U,EVT 


where 
U, -U(,Ln)-[u,..u.]e RO 


and 


XZ, = E(Im,lm) = diag(oi,...,0,) € FC". 


We refer to this much-used, trimmed down version of the SVD as the thin 
SVD. 


2.5.5 Rank Deficiency and the SVD 


One of the most valuable aspects of the SVD is that it enables us to deal 
sensibly with the concept of matrix rank. Numerous theorems in linear 
algebra have the form "if such-and-such a matrix has full rank, then such- 
and-such a property holds." While neat and aesthetic, results of this flavor 
do not help us address the numerical difficulties frequently encountered in 
situations where near rank deficiency prevails. Rounding errors and fuzzy 
data make rank determination a nontrivial exercise. Indeed, for some small 
« we may be interested in the «rank of a matrix which we define by 


rank(A, e) = min  rank(B). 
fA- Blase 


Thus, if A is obtained in a laboratory with each a4; correct to within +.001, 
then it might make sense to look at rank( A, .001). Along the same lines, if 
A is an m-by-n flosting point matrix then it is reasonable to regard A as 
numerically rank deficient if rank(A, c) < min{m,n} with e = ulj A ilz- 

Numerical rank deficiency and ¢-rank are nicely characterized in terms 
of the SVD because the singular values indicate how near a given matrix is 
to a matrix of lower rank. 


Theorem 2.5.3 Let the SVD of A e R™*" be given by Theorem 2.5.2. If 
k <r = rank( A) and 


k 
A, = oor ; (2.5.10) 
i=l 


min A-B = ||A-A = . 2.5.11 
Non I lla = ll klla = exi ( ) 
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Proof. Since UT AV = diag(c,,. ..,04,0,..., 0) it follows that rank(A,) = 
kand that UT(A—A,)V = diag(0,...,0,¢%41,---,0p} and 30 || A — Ax la = 
Tk+1: 

Now suppose rank( B) = k for some B € R™*". It follows that we can 
find orthonormal vectors 21,...,2n—% so null(B) = span[zi,...,24 4). 
A dimension argument shows that 


span[zi,...,za-x) N span(vy,... tky} x {0}. 


Let z be a unit 2-norm vector in this intersection. Since Bz = 0 and 


ki 
Az = Y evi zu 
imi 
we have 
k+l 
HA-~BR > I(4A-B} i} = Az = f Ror > ogn 


i=] 
completing the proof of the theorem. O 

Theorem 2.5.3 says that the smallest singular value of A is the 2-norm 
distance of A to the set of all rank-deficient matrices. It also follows that 


the set of full rank matrices in IR" *" is both open and dense. 
Finally, if re = rank( A, €}, then 


oO, cmm Tre Dt > Tret >: 2 Op p = min{m, n]. 


We have more to say about the numerical rank issue in $5.5 and $12.2. 


2.5.6 Unitary Matrices 


Over the complex field the unitary matrices correspond to the orthogonal 
matrices. In particular, Q € C"*" is unitary if Q7Q = QQ" = In. Unitary 
matrices preserve 2-norm. The SVD of a complex matrix involves unitary 
matrices. If A € (7**^, then there exist unitary matrices U € [™*™ and 
V e (7*7 such that 


UH AV = diag(m1,...,0,) € R™*" p-min(m,n] 
where a1 > 384 >... 2 0, 20. 


Problems 


P2.5.1 Show that if S is reel and ST = —S, then / — S is nonsingular and the matrix 
(2 — S)- (1I + S) is orthogonal. This is known aa the Cayley transform of S. 
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P2.5.2 Show that a triangular orthogonal matrix is diagonal. 
P2.5.3 Show that if Q = Qi + iQ2 is unitary with Q1, Q4 € EX", then the 2n-by-2n 


real matrix 
Zz [ Qi -Q: | 
Qs Q 
is orthogonal. 
P2.5.4 Establish properties (2.5.3)- (2.5.9). 
P2.5.5 Prove that 
YT Az 


A = max —— 
Smal A) = |, eR zem zahl 


P2.5.6 For the 2-by-2 matrix 4 = [ Y 1 |: derive expressions for ¢maz(A) and 
fmin(A) that are functions of w, 2, y, and z. 


P2.5.7 Show that any matrix in R™*® is the limit of a sequence of full rank matrices. 


P2.5.8 Show thet if A c R™** has rank n, then || A(AT A)- 1 A7 |; = 1. 
1 M 


P2.5,9 What is the nearest rank-one matrix to A = [ 0 1 


in the Frobenius norm? 


P2.5.10 Show that if A € R*" then || A {lp x «/rank(A) || Al|z, thereby sharpening 
(2.3.7). 


Notes and References for Sec. 2.5 


Forsythe and Moler (1967) offer a good account of the SVD’s role in the analysis of the 
Ax = b problem. Their proof of the decomposition is more traditional than ours in that 
it makes use of the eigenvalue theory for symmetric matrices, Historical SVD references 
include 


E. Beltrami (1873). "Sulle Funzioni Bilineari,” Gionale di Mathematiche 11, 98-106. 

C. Eckart and G. Young (1939). "A Principal Axis Transformation for Non-Hermitian 
Matrices," Bull. Amer. Math. Soc. 45, 118-21. 

G.W. Stewart (1993). “On the. Early History of the Singular Value Decomposition,” 
SIAM Review 55, 551-566. 


One of the most significant developments in scientific computation has been the increased 

use of the SVD in application areas that require the inteiligent handling of matrix rank. 

The range of applications is impressive. One of the moat interesting is 

C.B. Moler and D. Morrison (1983). “Singular Value Analyzis of Cryptograms," Amer. 
Math. Monthly 90, T8-87. 

For generalizations of the SVD to infinite dimensional Hilbert space, see 


LC. Gohberg and M.G. Krein (1969). Introduction to the Theory of Linear Non-Self 
Adjmnt Operators , Amer. Math. Soc., Providence, R.I. 

F. Smithies (1970). Integral Equations, Cambridge University Press, Cambridge. 

Reducing the rank of a matrix as in Theorem 2.5.3 when the perturbing matrix is con- 
ined is di i in 


J.W. Demmel (1987). “The smallest perturbation of a submatrix which lowers the rank 
and constrained total least squares problems, SIAM J. Numer. Anal. 24, 199-206. 
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G.H. Golub, A. Hoffman, and G.W. Stewart (1983). "A Generalization of the Eckart- 
Young-Mirsky Ápproximation Theorem." Lin, Alg. and Its Appüc. 88/89, 317-328. 

G.A. Watson (1988). "The Smallest Perturbation of a Submatrix which Lowers the Rank 
of the Matrix,” IMA J. Numer. Anal. 8, 295-304. 


2.6 Projections and the CS Decomposition 


If the object of a computation is to compute a matrix or a vector, then 
norms are useful for assessing the accuracy of the answer or for measuring 
progress during an iteration. If the object of a computation is to compute 
a subspace, then to make similar comments we need to be able to quantify 
the distance between two subspaces. Orthogonal projections are critical in 
this regard. After the elementary concepts are established we discuss the 
CS decomposition. This is an SVD-like decomposition that is handy when 
having to compare a pair of subspaces. We begin with the notion of an 
orthogonal projection. 


2.6.1 Orthogonal Projections 


Let S C IR" be a subspace. P c R"*" is the orthogonal projection onto 
S if ran(P) = S, P! = P, and P = P. From this definition it is easy to 
show that if z € R^, then Pz € 5 and (I — P)z € S4. 

If P, and P; are each orthogonal projections, then for amy z € R^ we 
have 

(P, = Pa)2 = (Piz) U - Paje + (hz) (I - P3). 

If ran(Pj) = ran(P3) = S, then the right-hand side of this expression is 
zero showing that the orthogonal projection for a subspace is unique. If the 
columns of V = [v;,..., vy | are an orthonormal basis for a subspace S, then 
it is easy to show that P = VVT is the unique orthogonal projection onto 
S. Note that if v € R^, then P = wv" /vTv is the orthogonal projection 
onto S = span{v}. 


2.6.0 SVD-Related Projections 


There are several important orthogonal projections associated with the sin- 
gular value decomposition. Suppose A = UXV? c R™** is the SVD of A 
and that r = rank( A). If we have the U and V partitionings 


U-[U & | v=e(% |] 
r m-r T n-r 
then 
V.VT = projection on to null(A)+ = ran(A?) 
V.VI = projection on to null( A) 
U,U? = projection on to ran(A) 


GUT projection on to ran(A)* = null(A7) 


H 
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2.6.3 Distance Between Subspaces 


The one-to-one correspondence between subspaces and orthogonal projec- 
tions enables us to devise a notion of distance between subspaces. Suppose 
5, and 54 are subspaces of IR" and that dim(S,) = dim(5z). We define the 
distance between these two apaces by 


dist(S,,52) = || Pi — Pa il, (2.6.1) 


where P, is the orthogonal projection onto S,. The distance between a 
pair of subspaces can be characterized in terms of the blocks of a certain 
orthogonal matrix. 


Theorem 2.6.1 Suppose 


Ws[W| Wi] Z-[Z Z | 
k n-k k n-k 


are n-by-n orthogonal matrices. If Sı = ran(Wi) and 54 = ran(Zi), then 
dist(S1,S2) = | W7 Zl = | ZTWa ila. 
Proof. 
dist(S;, 52) 


D 


WWT - 2127 I = IWTQAWT - AZTD)Z Il, 


Ü WZ, 
-WIZz 0 


Note that the matrices WT Z, and WT Z, are submatrices of the orthogonal 


matrix 
.[Qu Qu].[WIZA WIZA] yr 
2-[2: g2 |= | wre ma |=” 2. 


Our goal is to show that || Qa: fa = I Qiz lẹ Since Q is orthogonal it 
follows from 
Q rz]. Quz 
0 Quz 


1-2[ Quz + lQazli 
for all unit 2-norm x € IR*. Thus, 


2 


that 


{| Qai 113 


2 ; 2 
max [|Qazl;-1- min Qizl; 
|| = amt = fom 


1 - @min(Qu)?. 
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Analogously, by working with QT (which is also orthogonal) it is possible 


to show that 2 
I Qf da = 1 - ena (QT. 


and therefore 2 
| Qa ll = 1 — amin (Qui. 
Thus, || Qa ll; = Ì Quz 1. 9 
Note that if S; and S4 are subspaces in R” with the same dimension, then 


0 < dist(51, 53) < 1. 


The distance is zero if S, = Sz and one if S1(] S2 # [0]. 

A more refined analysis of the blocks of the Q matrix above sheds more 
light on the difference between a pair of subspaces. This requires a special 
SVD-like decomposition for orthogonal matrices, 


2.6. The CS Decomposition 


The blocks of an orthogonal matrix partitioned into 2-by-2 form have highly 
related SVDs. This is the gist of the CS decomposition. We prove a very 
useful special case first. 


Theorem 2.6.2 (The CS Decomposition (Thin Version)) Consider the 


matriz 


Q = a | Qi e Em, Qa e Rx 
2 


where m 2 n and mg 2 n. If the columns of Q are orthonormal, then there 
exist orthogonal matrices U, € R™*™ U, € R™*™ and V, € RP"? such 


that U olio c 
(eal tal[s] 
where 
C = diag(cos(8:),.. ., cos(85)), 
S = diag(sin(9i), .. . , sin(84)), 
and 


0z6,€685 X... SHS 


ja 


Proof. Since [| Qu | S IQ Ila = 1, the singular values of Qi, are all in 
the interval [0,1]. Let 
. Lh 0 t 
UTAV = C doe) = [i a 
t n—t 
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be the SVD of Q, where we assume 
12-2 ce-d6zcxez0 


To complete the proof of the theorem we must construct the orthogonal 
matrix Us. If 
QW =(W Wm] 
i n-t ' 


Uo T Qi E R 0 
0 I Q Y= 0 E|, 
ma J Wi Wai 
Since the columns of this matrix have unit 2-norm, Wi = 0. The columns 
of W are nonzero and mutually orthogonal because 
WIW, = In-e — ETE = diag(1— d, 4,....1— 2) 
is nonsingular. If s& = 4/1— ci for k = 1:n, then the columns of 
Z= Wa diag(1/St+1+. rra l/3n) 


are orthonormal. By Theorem 2.5.1 there exists an orthogonal matrix 
U € R™*™ with Us(:, t + 1:n) = Z. It is easy to verify that 


UF QV = diagí(si, . -+y $n) =5. 


Since c2 +s} = 1 for k = i:n, it follows that these quantities are the required 
cosines and sines. O 


then 


Using the same sort of techniques it is possible to prove the following more 
general version of the decomposition: 


Theorem 2.6.3 (CS Decomposition (General Version)) If 


Qu | Q12 
Qu 


is a 2-by-2 (arbitrary) partitioning of an n-by-n orthogonal matriz, then 
there exist orthogonal 


v= Pe] 


such that 
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where C = diag(c1,...,¢p) and S = diag(s1,...,55) are square diagonal 
matrices with 0 « c,,8; « 1. 


Proof. See Paige and Saunders (1981) for details. We have suppressed the 
dimensions of the zero submatrices, some of which may be empty. O 


The essential message of the decomposition is that the SVDs of the Qj; are 
highly related. 


Example 2.8.1 The matrix 


-0.7576 0.3697 0.3838 0.2126 -0.3112 
—0.4077  —0.1552 | —0.1129 0.2676 0.8517 


The angles associated with the cosines and sines turn out to be very im- 
portant in a number of applications. See $12.4. 


Probiems 


P2.6.1 Show that if P is an orthogonal projection, then Q = 7 — 2P is orthogonai. 
P2.6.2 What are the singular values of an orthogonal projection? 


P2.6.3 Suppose Sı = span(z) and 54 = span{y}, where r and y are unit 2-norm 
vectors in RÀ. Working only with the definition of dist(-,-), show that dist(51,5;) = 
v/1- (zTy)! verifying that the distance between Sı and 52 equals the sine of the angle 
between z and y. 


Notes and References for Sec. 2.8 
The following papers discuss various aspects of the CS decomposition: 


C. Davis and W. Kahan (1970). “The Rotation of Eigenvectora by » Perturbation III," 
SIAM J. Num. Anal. 7, 1-46. 

G.W. Stewart (1977). “On the Perturbation of Pseudo-Inverses, Projections and Linear 
Least Squares Problema,” SIAM Review 19, 634-662. 

C.C. Paige and M. Saunders (1981). “Toward a Generalized Singular Value Decomposi- 
tion,” SIAM J. Num. Anai. 18, 398—403. - 

C.C. Paige and M. Wei (1994), "History and Generality of the CS Decomposition,” Lin, 
Alg. and fts Applic. 208/209, 303-328. 
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Sea $8.7 for some computational details. 
For a deeper geometrica] understanding of the CS decomposition and the notion of 

distance between subspaces, see 

T.A. Arias, A. Edelman, and S. Smith (1996). “Conjugate Gradient and Newton's 
Method on the Graasman sod Stiefel Manifolds,” to appear in SIAM J. Matriz Anal. 
Appl. 


2.7 The Sensitivity of Square Systems 


We now use some of the tools developed in previous sections to analyze the 
linear system problem Ar = b where A € R'*" is nonsingular andé € IR". 
Our aim is to examine how perturbations in A and 5 affect the solution z. 
À much more detailed treatment may be found in Higham (1996). 


2.7.1 An SVD Analysis 
It a 
A= ew = y£zv? 
iml 
is the SVD of A, then 
z = Ab = (UEVT)y b = j wh, (2.7.1) 


t=1 °° 


This expansion shows that small changes in A or 5 can induce relatively 
large changes in z if en is small. 

It should come as no surprise that the magnitude of 2, should have 
a bearing on the sensitivity of the Ar = b problem when we recall from 
Theorem 2.5.3 that c, is the distance from A to the set of singular matrices. 
As the matrix of coefficients approaches this set, it is intuitively clear that 
the solution z should be increasingly sensitive to perturbations. 


2.7.2 Condition 


A precise measure of linear system sensitivity can be obtained by consider- 
ing the parameterized system 

(Ac-cF)z(e) b ef x(0)=2 
where F € R"*" and f € R". If A is nonsingular, then it is clear that x() 
is differentiable in a neighborhood of zero. Moreover, #(0) = A^*(f—Fz) 
and thus, the Taylor series expansion for z(e) has the form 


z(e) = z  ei(0)-- Ole). 
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Using any vector norm and consistent matrix norm we obtain 


ll z(e) - 2 |l a, SUFI ; 
Tal Sef A flair} + oe. (2.7.2) 


For square matrices A define the condition number x(.A) by 
(A) = All | Ao" (2.7.3) 


with the convention that (A) = oo for singular A. Using the inequality 
Hell < LAIT ed] it follows from (2.7.2) that 


ll z(e) -z| 


izi € k(A)(pa + m) + O(e?) (2.7.4) 


where 


IFI It Fl 
= |e and = lel —- 


represent the relative errors in A and b, respectively. Thus, the relative 
error in can be «(A) times the relative error in A and b. In this sense, the 
condition number «( A) quantifies the sensitivity of the Ar = b problem. 

Note that x(-) depends on the underlying norm and subscripts are used 
accordingly, e.g., 


e1( A) 
anl A) 


na(A) = 1A llel A7! Jle = (2.75) 
Thus, the 2-norm condition of a matrix A measures the elongation of the 
hyperellipsoid (Az : | z |a = 1}. 

We mention two other characterizations of the condition number. For 
p-norm condition numbers, we have 


1 ; | SAI, 


— = min ———. 2.7.6 
mA) ^ avaa singular VAT, 5) 
This result may be found in Kahan (1966) and shows that «,(4) measures 
the relative p-norm distance from A to the set of singular matrices. 

For any norm, we also have 


«(A)— lim sup I (A +AA -AH t . (277) 
E€—0 [AAĻ}SENAN rr T Ja 

This imposing result merely says that the condition number is a normalized 

Frechet derivative of the map A — A^!. Further details may be found in 

Rice (1966b). Recall that we were initially led to &(A) through differenti- 


ation. 
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if <(.A) is large, then A is said to be an ili-conditioned matrix. Note that 
this is a norm-dependent property?. However, any two condition numbers 
Ka(-} and &g(-) on R°™*" are equivalent in that constants c; and cz can be 
found for which 
Cikal A) € &g(À) X Cald) |— AC IR", 


For example, on IR"*" we have 


I^ 


r1(A) 


I^ 


: &3(AÀ) n(A) 


(4) 


l^ 


(A) € nme(A) (2.7.8) 


^ 


OD € Kold) < nêr (A). 


Thus, if a matrix is ill-conditioned in the a-norm, it is ill-conditioned in 
the 3-norm modulo the constants cı and cz above. 

For any of the p-norms, we have «,(A) > 1. Matrices with small con- 
dition numbers are said to be weil-conditioned . In the 2-norm, orthogonal 
matrices are perfectly conditioned in that x2(Q) = 1 if Q is orthogonal. 


2.7.3 Determinants and Nearness to Singularity 


It is natural to consider how well determinant size measures ill-conditioning. 
If det(A) = 0 is equivalent to singularity, is det(A) = 0 equivalent to near 
singularity? Unfortunately, there is little correlation between det(A) and 
the condition of Ar = b. For example, the matrix B, defined by 


l =l oe -=l 
0 1 -— -l 

Ba =i... , . Deme (2.7.9) 
0 o0 1 


has determinant 1, but &,(B4) = n2^^. On the other hand, a very weil 
conditioned matrix can have a very small determinant. For example, 


D, = diag(1071,...,10-!) e R*^ 
satisfies &y(D4) = 1 although det(D,) = 107^. 


2.7.4 A Rigorous Norm Bound 


Recall that the derivation of (2.7.4) was valuable because it highlighted the 
connection between «(A) and the rate of change of z(c) at e = 0. However, 


It also depends upon the definition of "large." The matter is pursued in $3.5 
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it is a little unsatisfying because it is contingent on ¢ being "small enough" 
and because it sheds no light on the size of the O(e*) term. In this and the 
next subsection we develop some additional] Az = b perturbation theorems 
that are completely rigorous. 

We first establish a useful lemma that indicates in terms of x( A) when 
we can expect a perturbed system to be nonsingular. 


Lemma 2.7.1 Suppose 
Az =b AER, 0#beR” 


(A+ AA)y 


with | AA || <el Ajj and | Ab || < elbil. Fes{A)=r <1, then A+AA 
is nonsingular and 


b+ Ab AAeéR™", Abe R" 


lvl < vr 
Iz i l-r 


Proof. Since [ ATAA] € © || AII | All = r < 1 it follows from 


Theorem 2.3.4 that (A+ AA) is nonsingular. Using Lemma 2.3.3 and the 
equality (I+ A^ 1AA)y = z + A^! Ab we find 


lvl < F-A7AAY' E(Iz lee A7 | oil) 


i anena 2 Hè 
Eas eet a? nnen = cm (leen) 


I^ 


Since | b|| = | Az i| < [| AW x || it follows that 


1 
Iyi $ z—zlerlzi.n 


We are now set to establish a rigorous Ar = b perturbation bound. 
Theorem 2.7.2 Jf the conditions of Lemma 2.7.1 hold, then 


ly-zü . 2e 


fer 31-244 


- (2.7.10) 


Proof. Since 
y-x = A!Ab- A^"! AAy (2.7.11) 


we have y-al] S e| A"! Hl] + ell A" ILL AT Ey I and so 


lv-zi Hol lyi 
ten £ Arame * AOT 


I^ 


ex(A) ( + Lx) - MA) a 
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Example 2.7.1 The Az = b problem 


[5 e ][8]* [e] 


has solution z = (1, 1)7 and condition &w (A) = 109. If Ab = (1075, 0)T, AA — 0, 
and (A+ AA)y = b + Ab, then y = (14-1075 , 1)7 and the inequality (2.7.10) says 
- We-vileo a || Ab tle 
1075 = ——— MR e 
Il = Ico TAS 
Thus, the upper bound in (2.7.10) can be a gross overestimate of the error induced by the 
perturbation. On the other band, if Ab = (0, 1075)7, AA = 0, and (A+AA)y = b-- Ab, 
then this inequality says 


Keo(A) = 1079109 = 1. 


E <2x 1079105 , 
Thus, there are perturbations for which the bound in (2.7.10) is essentially attained. 


2.7.5 Some Rigorous Componentwise Bounds 


We conclude this section by showing that a more refined perturbation the- 
ory is possible if componentwise perturbation bounds are in effect and if 
we make use of the absolute value notation. 


Theorem 2.7.3 Suppose 


Ax b Ac€IR""",0zbemR" 


H 


(A4 AA)y = bi Ab AAER™", Ab c IR 


and that [AA] x Aj and |Ab| S eb]. If swal A) =r < 1, then (A+ AA) 
is nonsingular and 


ll 9 — z loo 26 -1 
————— < — |] |A! |A . 


Proof. Since || AA l|; £ el] A llo; and || Ab [loa < ell b log the conditions of 
Lemma 2.7.1 are satisfied in the infinity norm. This implies that A+ AA 
is nonsingular and 
lylo . Ltr 
lzi ~ i-r 
Now using (2.7.11) we find 
ly-zi S [AT |AB]  [A7' LIA AL yl 


I^ 


el AT Ib] + ATIA fyl € e A7] AL (IE + dub) - 
If we take norms, then 


_ l+r 
ly zl S ll A=" [Al ls (112 1 + HH zle) . 
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The theorem follows upon division by || z |. O 


We refer to the quantity ||] AT} |A] la» as the Skeel condition number. It 
has been effectively used in the analysis of several important linear system 
computations. See 33.5. 

Lastly, we report on the results of Oettli and Prager (1964) that indicate 
when an approximate solution 7 € IR" to the n-by-n system Ar = b satis- 
fies a perturbed system with prescribed structure. In particular, suppose 
E c R°*" and f € R” are given and have nonnegative entries. We seek 
AA € R'*^, Ab c R”, and w > 0 such that 


(A+ AA) = b Ab |AA SwE, |Ab wf. (2.7.12) 


Note that by properly choosing E and f the perturbed. system can take on 
certain qualities. For example, if E = |A| and f = |b| and w is small, then 
$ satisfies a nearby system in the componentwise sense, Oettli and Prager 
(1964) show that for a given A, b, 2, E, and f the smallest w possible in 
(2.7.12) is given by 
wmin = max Stak 
isiga (E[£] + f); 


If AZ = b then wmin = 0. On the other hand, if wmin = 00, then 2 does 
not satisfy any system of the prescribed perturbation structure. 


Problems 


P2.7.1 Show that if | / || > 1, then x(A) > 1. 


P2.7.2. Show that for a given norm, «(AB) € &(A)&(B) and that x(aA) = «(A) for all 
nonzero a. 


P2.7.3 Relate the 2-norm condition of X € R™*" (m > n) to the 2-norm condition of 


the matrices x 
= Im 
2-[^o n] 


e-[ n 


Notes nnd Raferences for Sec. 2.7 
The condition concept is thoroughly investigated in 


J. Rice (1966). “A Theory of Condition,” SIAM J. Num. Anal 3, 287-310. 
W. Kahan (1966). “Numerical Linear Algebra,” Canadian Math. Bull 9, T57-801. 


References for componentwise perturbation theory include 
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Chapter 3 


General Linear Systems 


83.1 Triangular Systems 

$3.2 The LU Factorization 

$3.3 Roundoff Analysis of Gaussian Elimination 
83.4 Pivoting 

§3.5 Improving and Estimating Accuracy 


The problem of solving a linear system Ar = b is central in scientific 

computation. In this chapter we focus on the method of Gaussian elimi- 
nation, the algorithm of choice when A is square, dense, and unstructured. 
When A does not fall into this category, then the algorithms of Chapters 
4, 5, and 10 are of interest. Some parallel Az = b solvers are discussed in 
Chapter 6, 
. : We motivate the method of Gaussian elimination in §3.1 by discussing 
the ease with which triangular systems can be solved. The conversion of 
a general system to triangular form via Gauss transformations is then pre- 
sented in §3.2 where the “language” of matrix factorizations is introduced. 
Unfortunately, the derived method behaves very poorly on a nontrivial class 
of problems. Our error analysis in $3.3 pinpoints the difficulty and moti- 
vates $3.4, where the concept of pivoting is introduced. In the final section 
we comment upon the important practical issues associated with scaling, 
iterative improvement, and condition estimation. 


Before You Begin 


Chapter 1, §§2.1-2.5, and 82.7 are assumed. Complementary references 
include Forsythe and Moler (1967), Stewart (1973), Hager (1988), Watkins 
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(1991), Ciarlet (1992), Datta (1995), Higham (1996), Trefethen and Bau 
(1996), and Demmel (1996). Some MaTLaBfunctions important to this 
chapter are lu, cond, rcond, and the “backslash” operator “\". LAPACK 
connections include 


Solves Az = b 

Solves AX = B 

Condition estimate 

Solve AX = B, AT X = B with error bounds 
Solve AX = B, ATX = B 


Condition estimate via PA = LU 
Improve AX = B, ATX = B, AU X = B solutions with error bounds 
Solve AX = B, AT X = B, AP X = B with condition estimate 

PA= LU 

Solve AX = B, ATX = B, AN X = B via PA = LU 

ATI 

Equilibration 


3.1 Triangular Systems 
Traditional factorization methods for linear systems involve the conversion 
of the given square system to a triangular system that has the same solution. 
This section is about the solution of triangular systems. 
3.1. Forward Substitution 
Consider the following 2-by-2 lower triangular system: 

fy, 0 zi = by 

fo, fn | | z2 bj 
If /11£55 # 0, then the unknowns can be determined sequentially: 

zy = bf 

(by = £n zi)/ £s. 


This is the 2-by-2 version of an algorithm known as forward substitution. 
The general procedure is obtained by solving the ith equation in Lz — b 


for z;: 
i-i 
T; = ( Dp Je 
jel 


r2 


3.1. TRIANGULAR SYSTEMS 89 


If this is evaluated for i = 1:n, then a complete specification of z is obtained. 
Note that at the ith stage the dot product of L(i,1:i — 1) and z(1:i — 1) is 
required. Since b; only is involved in the formula for x, the former may be 
overwritten by the latter: 


Algorithm 3.1.1 (Forward Substitution: Row Version) If Lc R'*7 
is lower triangular and 6 € IR", then this algorithm overwrites b with the 
solution to Lr = b. L is assumed to be nonsingular. 

b(1) = &(1)/L(1,1) 

for i = 2:n 

b(i) = (b(i) — L(i, L:i — L)5(1:i — 1))/L(i, i) 

end 
This algorithm requires n? flops. Note that L is accessed by row. The 
computed solution $ satisfies: 


(L-F)&$ =b |F| € nu|L| + O(v?) (3.1.1) 


For a proof, see Higham (1996). It says that the computed solution exactly 
satisfies a slightly perturbed system. Moreover, each entry in the perturbing 
matrix F is small relative to the corresponding element of L. 


3.1.2 Back Substitution 


The analogous algorithm for upper triangular systems Uz = b is called 
back-substitution. The recipe for z; is prescribed by 


n 
z; = |b- J. uyt /* 
ju 


and once again b; can be overwritten by Ti. 


Algorithm 3.1.2 (Back Substitution: Row Version) If U c IR^*" 
is upper triangular and b € R”, then the following algorithm overwrites b 
with the solution to Uz = b. U is assumed to be nonsingular. 

b(n) = b(n)/U(n,n) 

for ti =n —- 1: -1:1 

b(i) = (b(1) — U(i, i  L:n)b(i + Lin) /U (i, 1) 

end 
This algorithm requires n? flops and accesses U by row. The computed 
solution ¢ obtained by the algorithm can be shown to satisfy 


(U +F) =b |F| € nu|U| + O(u?). (3.1.2) 
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3.1.3 Column Oriented Versions 


Column oriented versions of the above procedures can be obtained by re- 
versing loop orders. To understand what this means from the algebraic 
point of view, consider forward substitution. Once r, is resolved, it can 
be removed from equations 2 through n and we proceed with the reduced 
system L(2:n, 2:n)z(2:n) = b(2:n) — z(1) L(2:n, 1). We then compute za and 
remove it from equations 3 through n, etc. Thus, if this approach is applied 


20 0 zi 6 
150 I2 = 2 
79 8 I3 5 


we find z; = 3 and then deal with the 2-by-2 system 


[s allal- []-el]- ias] 


Here is the complete procedure with overwriting. 


Algorithm 3.1.3 (Forward Substitution: Column Version) If L e R°*" 
is lower triangular and b € IR", then this algorithm overwrites b with the 
solution to Lz =b. L is assumed to be nonsingular. 


for j=1:n-1 

b(j) = bG)/LG,9) 

b(j + Lin) = b(j + hen) - b(j) LG + Lin, j) 
end 


b(n) = d(n)/L(n, n) 


It is also possible to obtain a column-oriented saxpy procedure for back- 
substitution. 


Algorithm 3.1.4 (Back Substitution: Column Version) If U c R"^*^ 
is upper triangular and 5 € IR", then this algorithm overwrites b with the 
solution to Uz = b. U is assumed to be nonsingular. 


for jan: -1:2 

bC) = (7)/UG, 3) 

(ij - 1) = b(1:j — 1) - UG — 1,3) 
end 


b(1) = 5(1)/U (1,1) 


Note that the dominant operation in both Algorithms 3.1.3 and 3.1.4 is 
the saxpy operation. The roundoff behavior of these saxpy implementations 
is essentially the same as for the dot product, versions. 

The accuracy of a computed solution to a triangular system is often 
surprisingly good. See Higham (1996). 
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3.1.4 Multiple Right Hand Sides 


Consider the problem of computing a solution X € R"“* to LX = B where 
LcEP*" is lower triangular and B € K^**, This is the multiple right 
hand side forward substitution problem. We show that such & problem 
can be solved by a block algorithm that is rich in matrix multiplication 
assuming that q and n are large enough. This turns out to be important in 
subsequent sections where various block factorization schemes are discussed. 
We mention that although we are considering here just the lower triangular 
problem, everything we say applies to the upper triangular case as well. 

To develop a block forward substitution algorithm we partition the equa- 
tion LX = B as follows: 


Lu QO ee 0 Xi B, 
La Lia c: 0 Xs Bz 

E lu no i. (3.1.3) 
Lim. Ine cc LNN Xv By 


Assume that the diagonal blocks are square. Paralleling the development of 
Algorithm 3.1.3, we solve the system L11 Xı = B, for X4 and then remove 
X from block equations 2 through N: 


La 0 --- 0 Xo By - Lu Xy 
Leg Las o 0 Xa By — LaiXi 
Luo Lya -o Lyn XN Bu — Lyi Xı 


Continuing in this way we obtain the following block saxpy forward elimi- 
nation scheme: 
for j 2 :N 
Solve LyX; = Bj 
for i- jc LN (3.1.4) 
Bi = Bi = LyX; 
end 
end 


Notice that the i-loop oversees a single block saxpy update of the form 

Bia Bip Liig 

pj E a E [Ae 

By By Lus 
For this to be handled as a matrix multiplication in a given architec- 
ture it is clear that the blocking in (3.1.3) must give sufficiently "big" 
Aj. Let us assume that this is the case if each X; has at least r rows. 
This can be accomplished if N = ceil(n/r) and X1,..., Xy; € R”? and 
Xn € Rie (N— Dr) xg 
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3.1.5 The Level-3 Fraction 


It is handy to adopt a measure that quantifies the amount of matrix multi- 
plication in a given algorithm. To this end we define the level-3 fraction of 
an algorithm to be the fraction of flops that occur in the context of matrix 
multiplication. We call such flops ievel-3 flops. 

Let us determine the level-3 fraction for (3.1.4) with the simplifying 
assumption that n = rN. (The same conclusions hold with the unequal 
blocking desctibed above.) Because there are N applications of r-by-r 
forward elimination (the level-2 portion of the computation) and n? flops 
overall, the level-3 fraction is approximately given by 


Thus, for large N almost all flops are level-3 flops and it makes sense to 
Choose N as large as possible subject to the constraint that the underlying 
architecture can achieve a high level of performance when processing block 
saxpy's of width at least r = n/N. 


3.1.6 — Non-square Triangular System Solving 


The problem of solving nonsquare, m-by-n triangular systems deserves some 
mention. Consider first the lower triangular case when m > n, i.e., 


Lu z = b Lem b; e IR^ 
La ~ be La eR n Lege 


Assume that L1; is lower triangular, and nonsingular. If we apply forward 
elimination to Lijr = bi then z solves the system provided Lai(Li]bi) = 
ba. Otherwise, there is ne solution to the overall system. In such a case 
least squares minimization may be appropriate. See Chapter 5. 

Now consider the lower triangular system Lz = b when the number 
of columns n exceeds the number of rows rn. In this cage apply forward 
substitution to the square system L(1:m, 1:m)z(1:m, t:m) = b and prescribe 
an arbitrary value for r(m + 1:n}. See $5.7 for additional comments on 
systems that have more unknowns than equations, 

The handling of nonsquare upper trianguiar systems is similar. Details 
are left to the reader. 


3.1.7 Unit Triangular Systems 


A unit triangular matrix is a triangular matrix with ones on the diagonal. 
Many of the triangular matrix computations that follow have this added 
bit of structure. It clearly poses no difficulty in the above procedures. 
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3.1.8 The Algebra of Triangular Matrices 


For future reference we list a few properties about products and inverses af 
triangular and unit triangular matrices. 
« The inverse of an upper (lower) triangular matrix is upper (lower) 
triangular. 
e The product of two upper (lower) triangular matrices is upper (lower) 
triangular. 


ə The inverse of a unit upper (lower) triangular matrix is unit upper 
(lower) triangular. 


e The product of two unit upper (lower) triangular matrices is unit 
upper (lower) triangular. 


Problema 


P3.1.1 Give an algorithm for computing a nonzero z € R" such that Uz = 0 where 
U e R™* is upper triangular with unn = 0 and uii’ ua-i,a-1 X D. 

P3.1.2 Discuss how the determinant of a square triangular matrix could be computed 
with minimum risk of overflow and underflow. 


P3.1.3 Rewrite Algorithm 3.1.4 given that U is stored by column ia a length n(n 4- 1)/2 
array u.vee. 


P3.1.4 Write a detailed version of (3.1.4). Do not assume that N divides n. 
P3.1.5 Prove all the facts about triangular matrices that are listed in $3.1.8. 


P3.1.6 Suppose S,T c R?*" are upper triangular and that (ST — A)r = b is a non- 
singular system, Give an O(n?) algorithm for computing z.. Note that tbe explicit 
formation of ST — Af requires O(n?) flops. Hint. Suppose 


s-[$x] [iz] [k] 
where S4 = S(k—1L:n, k- in), T4. = T(k- Ln, k— Ln), by = bk Ln and z, 7,8 € R. 
Show that if we have a vector x. such that i 
(SeT — M)ze = be 
and we = Teze is availabie, then 


Y B- out z, - uwe 
7e | xe TT m- 


solves (S.T. - ALE = by. Observe that ty and wy = Tyrs each require O(n - k) 
Hope. 

P3.1.7 Suppose the matrice R;,..., Ry € R^*" are nil upper triangular. Give an 
O(pn?) algorithm for solving the system (Ri +++ Rp — AI)z = b amuming that the matrix 
of coefficients is nonsingular. Hint. Generalize the solution to the previous problem. 
Notes and References for Sec. 3.1 

The accuracy of triangular system solvers is analyzed in 


N.J. Higham (1989). “The Accuracy of Solutions to Triangular Systema," SIAM J. Num. 
Anal 26, 1252-1265. 
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3.2 The LU Factorization 


As we have just seen, triangular systems are "easy" to solve. The idea 
behind Gaussian elimination is to convert a given system Az = b to an 
equivalent triangular system. The conversion is achieved by taking appro- 
priate linear combinations of the equations. For example, in the system 
32, +529 = 9 
Orid 724 = 4 
if we multiply the first equation by 2 and subtract it from the second we 
obtain 
3r 5r. = 9 
-drg = -l4 
This ia n = 2 Gaussian elimination. Our objective in this section is to give 
a complete specification of this central procedure and to describe what it 
does in the language of matrix factorizations. This means showing that 


the algorithm computes a unit lower triangular matrix L and an upper 
triangular matrix U so that A = LU, e.g., 


iss] - let] a] 


The solution to the original Az = b problem is then found by a two step 
triangular solve process: 


Iy=b, Ur=y = Ar = LUz = Ly =b. 


The LU factorization is a "high-level" algebraic description of Gaussian 
elimination. Expressing the outcome of a matrix algorithm in the "lan- 
guage" of matrix factorizations is a worthwhile activity. It facilitates gen- 
eralization and highlights connections between algorithms that may appear 
very different at the scalar level. 


3.2.1 Gauss Transformations 


To obtain a factorization description of Gaussian elimination we need a 
matrix description of the zeroing process. At the n = 2 level if zı € 0 and 


T = z9/2,, then 
Ls ilz] = [9] 


More generally, suppose t € R” with z, Æ 0. If 


Ti 
TT = (0,...,0 0.44... Ta) n= = i=k+ln 
“Ke Zk 


k 
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and we define 
My, =I- rel, (3.2.1) 
then 
1 0 0 + 0 Ti Tı 
. 1 0 0 Tk 2] 
Miz = | 9 —Teti 1 0 Tk+1 0 
0 --- -mm U0 -- 1 Zn 0 


In general, a matrix of the form My = I — ref € EC*" is a Gauss trans- 
formation if the first k components of r € R” are zero. Such a matrix is 
unit lower triangular, The components of r(k + L:n} are called multipliers. 
The vector r is called the Gauss vector. 


3.2.2 Applying Gauss Transformations 


Multiplication by a Gauss transformation is particularly simple. If C € '*7 
and M, 2 I— rel is a Gauss transform, then 


M,C = (I-ref)C = C-—r(efC) = C-rClk,:). 


is an outer product update. Since r(1:k) = 0 only C(k + 1:n,:) is affected 
and the update C = M,C can be computed row-by-row as follows: 


for i=k+1n 
C(i,:) = CG, :) - r;C(K, :) 
end 


This computation requires 2(n — 1)r flops. 


4 7 0 
5 je] i | sace] 
6 10 -1 


3.2.3 Roundoff Properties of Gauss Transforms 


if f is the computed version of an exact Gauss vector T, then it is easy to 
verify that 


Example 3.2.1 


d». m od 
T 
ora 
m 
ates 
—! 


f=T+e Jel £ ult]. 
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If ? is used in a Gauss transform update and fH(I ~ eT )C) denotes the 
computed result, then 


fl((£-?e2)C) = (I - rez)C +E, 
where 
JEL € 3u(JC| + Ir]IC(E, )]) + O(u’). 


Clearly, if r has large components, then the errors in the update may be 
large in comparison to |C|. For this reason, care must be exercised when 
Gauss transformations are employed, a matter that is pursued in 83.4. 


3.2.4 Upper Triangularizing 


Assume that A € IR'*^. Gauss transformations Mj,..., M4. ., can usually 
be found such that M4. 1--- MaM, A = U is upper triangular. To see this 
we first look at the n — 3 case. Suppose 


if 
10 0 
Mi = —2 1 0], 
-3 0 1 
then 
1 4 T 
Mj;A = [0 -3 -6 
0 -6 -1l 
Likewise, 
1 0 0 1 4 T 
M3 = 0 1 0 => Ma3(MA) = 0 -3 -6 . 
0 -2 1 0 0 1 


Extrapolating from this example observe that during the kth step 


e We are confronted with a matrix AU-) = M, ,... M,A that is 
upper triangular in columns 1 to & — 1. 

e The multipliers in My are based on AC (k + 1:n, k). In particular, 
we need a £ 0 to proceed. 


Noting that complete upper triangularization is achieved after n — 1 steps 
we therefore obtain 
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k=1 
while (A(k, k) #0) & (k <n—-1) 
T(k + lin) = A(k + lin, K)/ Ak, k) (3.2.2) 
A(k + lin, :) = Alk + ln, :} — rik + bn) AR, :) 
k=k+1 
end 


The entry A(k, k} must be checked to avoid a zero divide. These quantities 
are referred to as the pivots and their relative magnitude turns out to be 
critically important. 


3.2.5 The LU Factorization 


In matrix language, if (3.2.2) terminates with k = n, then it computes 
Gauss transforms M,,...,M,—, such that M,_|---M,A = U is upper 
triangular. It is easy to check that if M, = I — TET, then its inverse is 
prescribed by Mz! = I + r(P'eT and so 


A- LU (3.2.3) 


where 
L= M) Mie (3.2.4) 


It is clear that L is a unit lower triangular matrix because each M, ! is unit 
lower triangular. The factorization (3.2.3) is called the LU factorization of 
A. 

As suggested by the need to check for zero pivots in (3.2.2), the LU 
factorization need not exist. For example, it is impossible to find l; and 


uj so 
1 1 0 o0 wi W2 tg 
2 = fa) 1 0 0 uy, 53 . 
3 fat £32 1 0 0 u33 


To see this equate entries and observe that we must have uy = 1, un = 2, 
fo, = 2, uaz = 0, and £341 = 3. But when we then look at the (3,2) entry 
we obtain the contradictory equation 5 = £413 + fagun = 6. 

As we now show, a zero pivot in (3.2.2) can be identified with a singular 
leading principal submatrix. 
Theorem 3.2.1 A c R"*” has an LU factorization if det(A(1:k, 1:k)) # 0 
fork =in—-1. If the LU factorization exists and A is nonsingular, then 
the LU factorization is unique and det(A) = uir +++ tian. 


Proof. Suppose k — 1 steps in (3.2.2) have been executed. At the beginning 
of step k the matrix A has been overwritten by My ,--- M,A = AU-. 


Note that alk- D is the kth pivot. Since the Gauss transformations are 
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unit lower triangular it follows by looking a£ the leading k-by-K portion of 
this equation that det(A(1:k, 1:k)) = a^?) . aU U. Thus, if A(L:k, L:k) 
is nonsingular then the kth pivot is nonzero. 

As for uniqueness, if A = LU; and A = Lala are two LU factorizations 
of a nonsingular A, then L;!Li = U;U,!. Since Lj!L, is unit lower 
triangular and U;U, is upper triangular, it follows that both of these 
matrices must equal the identity. Hence, Lı = La and U; = U3. 

Finally, if A = LU then det(A) =  det(LU) = det(L)det(U) = 
det(U) = uii--+ tan. O 


3.2.6 Some Practical Details 


From the practical point of view there are several improvements that can 
be made to (3.2.2). First, because zeros have already been introduced in 
columns 1 through k — 1, the Gauss transform update need only be applied 
to columns k through n. Of course, we need not even apply the kth Gauss 
transform to Á(: k) since we know the result. So the efficient thing to do 
is simply to update A(k + lin, k E En). Another worthwhile observation is 
that the multipliers associated with M, can be stored in the locations that 
they zero, i.e., A(k + l:n, k). With these changes we obtain the foilewing 
version of (3.2.2): 


Algorithm 3.2.1 (Outer Product Gaussian Elimination) Suppose 
A € IR?*" has the property that A(1:k, 1:k) is nonsingular for k = l:n — 1. 
This algorithm computes the factorization Mn-1 -° M,A = U where U is 
upper triangular and each M, is a Gauss transform. U is stored in the 
upper triangle of A. The multipliers associated with M, are stored in 
Alk + Ln, k), ie, Alk + lin, k) = -Mlk + bin, k). 


for k= in-i 

rows =k + lin 

A(rows, k) = A(rows, k)/ A(k, k) 

A(rews, rows) = A(rows, rows) — A(rows, k) A(k, rows) 
end 


This algorithm involves 2n?/3 flops and it is one of several formulations of 
Gaussian Elimination. Note that each pass through the k-loop involves an 
outer product. 


3.2.7 Where is L? 
Algorithm 3.2.3 represents L in terms of the multipliers, In particular, if 


T) is the vector of multipliers associated with My then upon termination, 
A(k + in, k) = r(9. One of the mare happy “coincidences” in matrix 
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computations is that if L = M, ^ .-- MI}, then L(k + i:n, k) = 1“). This 
follows from a careful look at the product that defines L. Indeed, 


L= (7 + «D Zn (1 * rer.) = I+ Pre. 
k=l 


Since A(k + I:n, k) houses the kth vector of multipliers r/& , it follows that 
A(i, k) houses £j, for all i > k. 


3.2.8 Solving a Linear System 


Once A has been factored via Algorithm 3.2.1, then L and [are represented 
in the array A, We can then solve the system Az = 6 via the triangular 
systems Ly = b and Uz = y by using the methods of $3.1. 


Example 3.2.2 If Algorithm 3.2.1 is applied to 


14 7 100 1 4 T 
A= 2 5 8 = 2 10 0 -3 -6], 
3 6 190 3 2 1 9 o 1 
then upon completion, 
1 4 7 
Az 2 -3à -6 . 
3 2 H 


If b = (1,1, 1)7, then y = (1, —1,0)7 solves Ly = b and z = (—1/3,1/2,0)7. solver 
Uz-y. 


3.2.9 Other Versions 


Gaussian elimination, like matrix multiplication, is a triple-loop procedure 
that can be arranged in several ways. Algorithm 3.2.1 corresponds to the 
“kij” version of Gaussian elimination if we compute the outer product 
update row-by-row: 


for k2 ln -1 
A(k + lin, k) = A(k + Lin, kK)/ A(k, k) 
fori=k+1:n 
for j=k+1n 
Ai, j) = Ali j) — A(G, k)A(&, 7) 
end 
end 
end 


There are five other versions: Aji, ikj, ijk, jik, and jki. The last of these 
results in an implementation that features a sequence of gaxpy's and for- 
ward eliminations. In this formulation, the Gauss transformations are not 
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immediately applied to A as they are in the outer product version. Instead, 
their application is delayed. The original A(:, j) is untouched until step j. 
At that point in the algorithm A(:, J) is overwritten by M;.1--- Mi Á(: j). 
The jth Gauss transformation is then computed. 

To be precise, suppose 1 S j < n — 1 and assume that L(:,1:j — 1) 
and U(1:j — 1,1:j — 1) are known. This means that the first j — 1 columns 
of L and U are available. To get the jth columns of L and U we equate 
jth columns in the equation A = LU: A(;j) = LU(:,j). From this we 
conclude that 


A(Lj- 1,4) = L(lj = 1j ~ 1U(L - 1,5) 
and 
J 
Alin, j) = x L(j:n, k)U(k, j) . 
k=1 


The first equation is a lower triangular system that can be solved for the 
vector U(1:j — 1,7). Once this is accomplished, the second equation can be 
rearranged to produce recipes for U(j,7) and L(j + lin, j). Indeed, if we 
set 


20 jel 
vijn) = A(jin, j) — V. LG, kU, j) 

k=1 
A(j:n,J) - L(jym, 1:5 - Uy - 1,5), 


then L(j + Lin, 7) = v(j + Ln)/v(j) and U(5, j) = u(j). Thus, L(j + Ln, j) 
is a scaled gaxpy and we obtain 


L=F,0=0 
for j = 1:n 
ifj=1 
v(jin)} = A(j:n, j) 
else 
Solve L(1:7 — 1,1:j -- 1)z = A(1:j ~ 1,j) for z (3.2.5) 
and set U(1:j — 1,3Y = z. 
v(3:n) = A(j:n, j) - Lin, 1:j — 1)z 
end 
ifj<n 
Lj + La, j) = off  E:in)/v(7) 
end 
UG, i) = vG) 


end 


This arrangement of Gaussian elimination is rich in forward eliminations 
and gaxpy operations and, like Algorithm 3.2.1, requires 2n3/3 flops. 
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3.2.10 Block LU 


It is possible to organize Gaussian elimination so that matrix multiplication 
becomes the dominant operation. The key to the derivation of this block 
procedure is to partition A € RO™* as follows 


Ac Ay, An r 
Agr Aga "n—r 


T n-—rTr 


where r is a blocking parameter. Suppose we compute the LU factorization 
Li; Usi = Ar, and then solve the multiple right hand side triangular systems 
Luli = Ai and Lat = Ag, for Uiz and Lay respectively. It follows 


that 

An An = | oun olih Of] Ui. Up 

An An La L.- 0 A 0 Iner 
where A = Ag — LgjUs2. The matrix Á is the Schur complement of Ay 
with respect to A. Note that if A = [alg is the LU factorization of A, 


then 

An Ag], [Lin 0 L 0 Uu Un 

An Az La Ln 0 À 0 Us 
is the LU factorization of A. Thus, after £11, L21, Ui; and U»3, are com- 
puted, we repeat the process on the level-3 updated (2,2) block A. 


Algorithm 3.2.2 (Block Outer Product LU) Suppose A € R"™" 
and that det( A(1:k, 1:k) is nonzero for k = l:n — 1. Assume that r satisfies 
lXr <n. The following algorithm computes A = LU via rank r updates. 
Upon completion, A(i, j) is overwritten with L(i,7) for i > j and A(i, j) is 
overwritten with U(i,j) if j >i. 


A=1 
while à <n 
u= minn, à +r — 1) 
Use Algorithm 3.2.1 to overwrite A(A;u, Aigi) 
with its LU factors Z and Ü. 
Solve LZ = A(A:p, + i:n) for Z and overwrite 
A(Azg, u + iin) with Z. 
Solve WỌ = A(u + L:n, Az) for W and overwrite 
A(p + Ln, Xa) with W. 
Alu + i:n, p + din) = A(u + En, u + lin) -WZ 
A-ucl 
end 
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This algorithm involves 2n*/3 flops. 

Recalling the discussion in §3.1.5, let us consider the level-3 fraction 
for this procedure assuming that r is large enough so that the underlying 
computer is able to compute the matrix multiply update A(u + En, ui + 
Ln) = A(t ln, +12) - WZ at “level-3 speed.” Assume for clarity 
that n — rN. The only flops that are not level-3 flope occur in the context 
of the r-by-r LU factorizations A(A:, Xu) = LU. Since there are N such 
systems solved in the overall computation, we see that the levei-3 fraction 
is given by 
N(2r3/3) — i 1 

2n3j3 | ^ Ni: 
Thus, for large N almost all arithmetic takes place in the context of matrix 
multiplication. As we have mentioned, this ensures high performance on a 
wide range of computing environments. 


3.2.11 The LU Factorization of a Rectangular Matrix 


The LU factorization of a rectangular matrix A € R™*" can also be per- 
formed. The m > n case is illustrated by 


S- [2] 4] 
aep 


depicts the m < n situation. The LU factorization of A € IR?*" is guaran- 
teed to exist if A(1:k, 1:k) is nonsingular for k = 1:min(m,n). 
The square LU factorization algorithms above need only minor modifi- 


cation to handle the rectangular case. For example, to handle the m > n 
case we modify Algorithm 3.2.1 as follows: 


for k = 1m 
rows =k +1:m 
A(rows, k) = A(rows, k)/ A(k, k) 


M kn 

cols =k + l:n 

A(reotws, cols) = A(rows,cois) — A(rows, k) A(k, cols) 
end 


end 


This algorithm requires mn? — n?/3 flops. 
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3.2.12 A Note on Failure 


As we know, Gaussian elimination fails unless the first n — 1 principal 
submatrices are nonsingular. This rules out some very simple matrices, 


eg. 
0 1 
A= 10 | ` 

While A has perfect 2-norm condition, it fails to have an LU factorization 
because it has a singular leading principal submatrix. 

Clearly, modifications are necessary if Gaussian elimination is to be 
effectively used in general linear system solving. The error analysis in the 
following section suggests the needed modifications. 


Probiems 


PS.2.1 Suppose the entries of A(«) € R” “T are continuously differentiable functions of 
the scalar e. Assume thet A = A(0) and all its principal submatrices are nonsingular. 
Show that for sufficiently small <, the matrix A(e} has an LU factorization A(e) = 
L(e)U(ce) and that L(c) and U(e¢) are both continuously differentiable. 


P3.2.2 Suppose we partition A c HR? *^ 


[An An 
A= | An An | 


where A11 isr-by-r. Assume that A11 is nonsingular. The matrix S = 433 — Az An! A13 
is called the Schur complement of Aj, in A. Show that if Ai; bas an LU factorization, 
then after r steps of Algorithm 3.2.1, Afr + l:n,r + l:in) houses S. How could § be 
obtained after r steps of (3.2.5)? 

P3.2.3 Suppose A E E" *" hag an LU factorization. Show how Az x b can be solved 
without storing the multipliers by computing the LU factorization of the n-by-(n + 1) 
matrix [A b]. 

PS.2.4 Describe a variant of Gaussian elimination that introduces zeros into the columns 
of A in the order, n; — 1:2 and which produces the factorization A = U L where U is unit 
upper triangular and L is lower triangular. 

P3.2.5 Matrices in RB" of the form N(y,k) = 1 — vl where y & R^ are said to 
be Gauss-Jordan transformations, (a) Give a formula for N(y, k)"! mmuming it exista. 
(b) Given z € R”, under what conditions can y be found so N(y, k)z = ep? (c) Give 
an algorithm using Gauss-Jordan transformations that overwrites A with A7*. What 
conditions on A ensure the success of your algorithm? 

P3.2.6 Extend (3.2.5) so that it can also handle the case when A has more rows than 
columns. 


P3.2.T Show how 4 can be overwritten with L and U in (3.2.5). Organize the three 
loops so that unit stride access prevails. 

P3.2.8 Develop a version of Gaussian elimination in which the innermost of the three 
loops oversees a dot product. 
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Notes and Referances for Sec. 3.2 


Schur complements (P3.2.2) arise in many applications. For a survey of both practical 
and theoretical interest, see 


FLW. Cottle (1974). "Manifestations of the Schur Complement,” Lin. Alg. and its 
Applic. 8, 189-211, 


Schur complements are known ag "Gauss transformas” in some application areas. The 
use of Gauss-Jordan transformations (P3.2.5) in detailed in Fox (1964). See also 


T. Dekker and W. Hoffman (1989). “Rehabilitation of the Gauss-Jordan Algorithm,” 
Numer. Math. 54, 591-599. 


As we mentioned, inner product versions of Gaussian elimination have been known and 
used for some time. The names of Crout and Doolittle are associated with these ijk 
techniques. They were popular during the days of desk calculators because there are 
far fewer intermediate results than in Gaussian elimination. These methods still have 
attraction because they can be implemented with accumulated inner products. For re- 
marks along these lines see Fox (1964) as well as Stewart (1973, pp. 131-39). See also: 


G.E. Forsythe (1960). “Crout with Pivoting,” Comm. ACM 3, 507-8. 
W.M. McKeeman (1962). “Crout with Equilibration and Iteration,” Comm. ACM. 5, 
553-55. 


Loop orderings and block issues in LU computations are discussed in 


J.J. Dongarra, F.G. Gustavson, and A. Karp (1984). "Implementing Linear Algebra 
Algorithms for Dense Matrices on a Vector Pipeline Machine,” SIAM Review 26, 
91-112. 

J.M. Ortega (1988). “The ijk Forms of Factorization Methods I: Vector Computers,” 
Parallel Computers 7, 135-147. 

D.H. Bailey, K.Lee, and H.D. Simon (1991). “Using Strassen'a Algorithm to Accelerate 
the Solution of Linear Systems,” J. Supercomputing 4, 357-371. 

JW. Demmel, N.J. Higham, and BLS. Schreiber (1995). “Stability of Block LU Factor- 
ization,” Numer. Lin. Alg, with Applic. 2, 173-190. 


3.3 Roundoff Analysis of Gaussian Elimina- 
tion 


We now assess the effect of rounding errors when the algorithms in the 
previous two sections are used to solve the linear system Az = b. A much 
more detailed treatment of roundoff error in Gaussian elimination is given 
in Higham (1996). 

Before we proceed with the analysis, it is useful to consider the nearly 
ideal situation in which no roundoff occurs during the entire solution process 
except when A and b are stored. Thus, if fl(b) = b+e and the stored matrix 
fl(A) = A+ E is nonsingular, then we are assuming that the computed 
solution = satisfies 


(A-E)-(bre) Ello £ ul Al felle Sub lo. (3.3.1) 
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That is, Z solves a "nearby" system exactly. Moreover, if uxgo(A) € $ 
(say), then by using Theorem 2.7.2, it can be shown that 


lz -=ê dos 


io i9). (3.3.2) 


The bounds (3.3.1) and (3.3.2) are "best possible" norm bounds. No general 
oo-norm error analysis of a linear equation solver that requires the storage of 
A and b can render sharper bounds. As a consequence, we cannot justifiably 
criticize an algorithm for returning an inaccurate ĉ if A is ill-conditioned 
relative to the machine precision, e.g., usas (A) *: 1. 


3.3.1 Errors in the LU Factorization 


Let us see how the error bounds for Gaussian elimination compare with 
the ideal bounds above. We work with the infinity norm for convenience 
and focus our attention on Algorithm 3.2.3, the outer product version. 
The error bounds that we derive alsa apply to Algorithm 3.2.4, the gaxpy 
formulation. 

Our first task is to quantify the roundoff errors associated with the 
computed triangular factors. 


Theorem 3.3.1 Assume that A is an n-by-n matriz of floating point num- 
bers. If no zero pivots are encountered during the execution of Algorithm 
3.2.3, then the computed triangular matrices L and U satisfy 


T= A+H (3.3.3) 
|Z| < 3(n — 1)u (1al £101) + Ou), (3.3.4) 


Proof. The proof-is by induction on n. The theorem obviously holds for 
n — 1. Assume it holds for all (n — 1)- by-(n — 1) floating point matrices. If 


a w? 1 
A= P B n-1 
la-l 


then 2 = fi(v/a) and A, = fi(B — 2wT) are computed in the first step of 
the algorithm. We therefore have 


1 
2 = lw. S 
à-ceef fS 


Ble 


(3.3.5) 


and 


Ay = B-iw™+F |F| € 2u({Bl + |élwi7)+0(u2). (3-3.6) 
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The algorithm now proceeds to calculate the LU factorization of Ay. By 
induction, we compute approximate factors Ê, and U, for A, that satisfy 


FALA = Ay + Ay (3.3.7) 
[Hil < 3(n~2)u (JÂ: IO.) + O(n). (3.3.8) 
Thus, 
we [idl al 
= A* | oy ur] = ASH. 


From (3.3.6) it follows that 
lå} < (1 + 2u) (|B| + Hzlfw]T) + O(n), 
and therefore by using (3.3.7) and (3.3.8) we have 
|H, + F| € 3(n - 1)u (im + lêle? + EATUR) + Ow?) 


Since |a f| € ujv| it is easy to verify that 


LER SUPERI ra |} rec 


thereby proving the theorem. O 


We mention that if A is m-by-n, then the theorem applies with n in (3.3.4) 
replaced by the smaller of n and m . . 


3.3.2  Triangular Solving with Inexact Triangles 


We next examine the effect of roundoff error when Ê and Ü are used by the 
triangular system solvers of 83.1. 


Theorem 3.3.2 Let L and Ü be the computed LU factors of the n-by-n 
Roating point matriz A obtained by either Algorithm 3.2.3 or 3.2.4. Suppose 
the methods of $3.1 are used to produce the computed solution j to Ly = b 
and the computed solution 2 to Ux = 9. Then (A+ E)£ =b with 


|E] < nu (314! 5148] + O(n). (3.3.9) 
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Proof. From (3.1.1) and (3.1.2) we have 


(L+F)j = b IPFI 
(«c = b IG 


nul|£| + Olu?) 
null] + O(u?) 


IA IA 


| 


and thus 
(L+F\(0+G)2 = (LU + FÜ + ÎG + FG)t=6. 


From Theorem 3.3.1 . 
LU =A+4, 


with |H| € 3(n — 1)u(|A] + LZ|[D]) + O(u?), and so by defining 
E = H+FOU+iG+FG 
we find (A+ E)£ = b. Moreover, 
|El |] + [FIG] + £101] + Ofu’) 
3nu (|A| + IŻ) + 27u (12101) + O(u?). c 


IA 


IA 


Were it not for the possibility of a large |£||U'| term, (3.3.9) would compare 
favorably with the ideal bound in (3.3.1). (The factor n is of no conse- 
quence, cf, the Wilkinson quotation in 82.4.6.) Such a possibility exists, for 
there is nothing in Gaussian elimination to rule out the appearance of small 
pivots. If a small pivot is encountered, then we can expect large numbers 
to be present in Ê and U. 

We stress that small pivots are not necessarily due to ill-conditioning as 
1 i | bears out. Thus, Gaussian elimination can give 
arbitrarily poor results, even for well-conditioned problems. The method is 
unstable. ` 

In order to repair this shortcoming of the algorithm, it is necessary to 
introduce row and/or column interchanges during the elimination process 
with the intention of keeping the numbers that arise during the calculation 
suitably bounded. This idea is pursued in the next section. 


the example A = | 


Example 3.3.1 Suppose f = 10, t = 3, floating point arithmetic in used to solve: 


001 1.00 Ti _ 1.00 " 
1.00 2.00 a | [| 300 73° 


Applying Gauasian elimination we get 


; 1 0 ; oor 1 
i = [ io 1] a-[% 000 | 
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-a 
Moreover, 6 | 10-» 1001 ] is the bounding matrix in (3.3.4), not a severe overesti- 


mate of |H|. If we go on to solve the problem using the triangular system soivers of §3.1, 
then waing the same precision arithmetic we obtain a computed solution # = (6,1)7. 
This is in contrast to the exact solution z = (1.002..., .998...)7. 


Problems 


P3.3.1 Show that if we drop the assumption that A is a floating point matrix in 
Theorem 3.3.1, then (3.3.4) hoida with the coefficient “3” replaced by "4." 


P3.3.2 Suppose A is an n-by-n matrix and that L and U are produced by Algorithm 
3.2.1. (a) How many flops are required to compute || L| {U1 la»? (b) Show REV « 
(14 2nu)|2][D1 + O(n}. 
P3.3.3 Suppose x = A^!b. Show that if e = z — £ (the error) and r = b — Az (the 
residual), then 

iri 

—— < fell € FAT TV rd. 

ET iel < ff Mn d 
Assume consistency between the matrix and vector norm. 


P3.3.4 Using 2-digit, base 10, floating point arithmetic, compute the LU factorization 


of 
7 6 
A= IE 


For this example, what is the matrix H in (3.3.3)? 


Notes and Refereuces for Sec. 3.3 
The original roundoff analysis of Gaumian elimination appears in 


JH. Wilkinson (1961). “Error Analysis of Direct Methods of Matrix Inversion,” J. ACM 
8, 281-330. 


Various improvements in the bounds and simplifications in the analysis have occurred 
over the years. See 


B.A. Chartres and J.C. Geuder (1967). “Computable Error Bounds for Direct Solution 
of Linear Equations,” J. ACM 14, 63-71. 

J.K. Reid (1971). “A Note on the Stability of Gaussian Elimination," J. inst. Math. 
Applic. d, 3T4-T5. 

C.C. Paige (1973). "An Error Analysis of a Method for Solving Matrix Equations,” 
Math, Comp. 27, 355-59. 

C. de Boor and A. Pinkus (1977). “A Backward Error Analysis for Totally Positive 
Linear Systema," Numer. Math. 27, 485-90. 

H.H. Robertmon (1977). "The Accuracy of Error Estimates for Systema of Linear Aige- 
braic Equations,” J. Inst. Math. Applic. 20, 409-14. 

J.J. Du Cros and N.J. Higham (1992). “Stability of Methods for Matrix Inversion,” IMA 
J. Num. Anak 12, 1-19, 
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3.4 Pivoting 


The analysis in the previous section shows that we must take steps to ensure 
that no large entries appear in the computed triangular factors £ and U. 
The example 


0001 1 1 o][.o 1i 
4-| 1 1] = [10,000 11] 0 Lo | = 20 


correctly identifies the source of the difficulty: relatively small pivots. A 
way out of this difficulty is to interchange rows. In our example, if P is the 
permutation 


then 


1 1 1 0][i 1 
PA = | oor 1] = | on i [fo s9 | 29 


Now the triangular factors are comprised of acceptably small elements. 

In this section we show how to determine a permuted version of A that 
has a reasonably stable LU factorization. There are several ways to do 
this and they each correspond to a different pivoting strategy. We focus 
on partial pivoting and complete pivoting. The efficient implementation 
of these strategies and their properties are discussed. We begin with a 
discussion of permutation matrix manipulation. 


3.4.1 Permutation Matrices 


The stabilizations of Gaussian elimination that are developed in this sec- 
tion involve data movements such as the interchange of two matrix rows. 
In keeping with our desire to describe all computations in “matrix terms,” 
it is necessary to acquire a familiarity with permutation matrices. A per- 
mutation matrix is just the identity with its rows re-ordered, e.g., 


0 


oor 

woo ge 
eoroe 
Soo cr 


An n-by-n permutation matrix should never be explicitly stored. It is much 
more efficient to represent a general permutation matrix P with an integer 
n-vector p. One way to do this is to let p(k) be the column index of the 
sole "1" in P's kth row. Thus, p — [1132] is the appropriate encoding of 
the above P. It is also possible to encode P on the basis of where the "1" 
occurs in each column, e.g., p = [2 43 1}. 
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If P is a permutation and A is a matrix, then PA is a row permuted 
version of A and AP is a column permuted version of A. Permutation 
matrices are orthogonal and so if P is a permutation, then P7! = PT. A 
product of permutation matrices is a permutation matrix. 

In this section we are particularly interested in interchange permuta- 
tions. These are permutations obtained by merely swapping two rows in 
the identity, e.g., 


nÍ oc 


ü 0 1 
100 
010 
00 0 
Interchange permutations can be used to describe row and column swap- 
ping. With the above 4by-4 example, EA is A with rows 1 and 4 inter- 
changed. Likewise, AE is A with columns 1 and 4 swapped. 

If P = E,--- Ey and each E, is the identity with rows k and p(k) 
interchanged, then p(1:n) is a useful vector encoding of P. Indeed, z € IR" 
can be overwritten by Pr as follows: 


fork = Lin 


z(k) + z(p(&)) 


end 


Here, the “++” notation means “swap contents.” Since each Ep is symmetric 
and PT = E,... En, the representation can also be used to overwrite x with 
Pz; 


for k=n:-1:1 
z(k) + z(p(k)) 


end 


It should be noted that no floating point arithmetic is involved in a permu- 
tation operation. However, permutation matrix operations often involve the 
irregular mavement of data and can represent a significant computational 
overhead. 


3.4.2 Partial Pivoting: The Basic Idea 


We show how interchange permutations can be used in LU computations to 
guarantee that no multiplier is greater than one in absolute value. Suppose 


3.17 10 
A=|[2 4 -2]. 


6 18 -12 
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To get the smailest possible multipliers in the first Gauss transform using 
row interchanges we need a), to be the largest entry in the first column. 
Thus, if E, is the interchange permutation 


001 
& = ]01 0 
100 
then 
6 18 -12 
EA-2-|2 4 -2 
3 17 10 
and 
1 0 0 6 18 -12 
Mı = | -1/3 1 0 => M BA = | 0 -2 2|. 
-1/2 0 1 0 8 16 


Now to get the smallest possible multiplier in M2 we need to swap rows 2 


and 3. Thus, if 
0 0 1 0 0 
0 1 and M4,210 1 O0 
1 0 


0 1/4 1 
then 
6 18 -12 
MEM EA = | 0 8 16[. 
0 o0 6 


The example illustrates the basic idea behind the row interchanges. In 
general we have: 


for k - En -1 
Determine an interchange matrix Ey with Ek(1:k, 1:k) = {k 
such that if z is the kth column of ELA, then 
lz(k)] = |} zz) Ilao- 
A=E,A 
Determine the Gauss transform Mj such that if v is the 
kth column of M,A, then vik + 1:n) =0. 
A= MA 
end 


This particular row interchange strategy is called partial pivoting. Upon 
completion we emerge with M,-,2,-1:--44,£,A = U, an upper triangu- 
lar matrix. 
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As & consequence of the partial pivoting, no multiplier is larger than 
one in absolute value. This is because 


IGEkMa-1i ME, A) ge] = ax [(EcMi-i--- MEA) 
Linn 


for k = 1:5 — 1. Thus, partial pivoting effectively guards against arbitrarily 
large multipliers. 


3.4.8 Partial Pivoting Details 


We are now set to detail the overall Gaussian Elimination with partial piv- 
oting algorithm. 


Algorithm 3.4.1 (Gauss Elimination with Partial Pivoting) If 
A c R'*", then this algorithm computes Gauss transforms Mi,- M4. 
and interchange permutations Ei, - -+ En-1 such that Ma Eg 21 M EA 
= U is upper triangular. No multiplier is bigger than 1 in absolute value. 
A(1:k, k) is overwritten by U(1:k, k), k = Ln. Alk + Ln, k) is overwritten 
by —M,(k + Link), k = Ein — 1. The integer vector »(1l:n — 1) defines 
the interchange permutations. In particular, E, interchanges rows k and 
p(k), k = 1m — 1. 


for k = ln-1 
Determine p with k < u < n so | Á(u, &)| = {| A(&:n, k) ilo 
A(k, kin} = A(j, kin) 
p(k) = ps 
if A(k,k) #0 
rows =k +1:n 
A(rows, k) = A(rows, k)/A(k, k) 
A(rows, rows) = A(rous, rows) — A(rows, k) Alk, rows) 
end 
end 


Note that if || A(k:n, k) |o; = 0 in step &, then in exact arithmetic the first 
k columns of A are linearly dependent. In contrast to Algorithm 3.2.1, this 
poses no difficulty. We merely skip over the zero pivot. 

The overhead associated with partial pivoting is minimal from the stand- 
point of floating point arithmetic as there are only O(n?) comparisons asso- 
ciated with the search for the pivots. The overall algorithm involves 2n3/3 
flops. 

To solve the linear system Az = b after invoking Algorithm 3.4.1 we 


* Compute y = Mn-1En-1 +- M1 Eb. 
* Solve the upper triangular system Uz = y. 
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All the information necessary to do this is contained in the array A and the 
pivot vector p. Indeed, the calculation 


fork —-in-1 

b(k) e b(p(k)) 

b(E + lin) = b(k + Ln) — b(K) ALK + Lin, k) 
end 


overwrites b with M,_,E,-1--- Mi, Ey. 
Example 3.4.1 If Algorithm 3.4.1 is applied to 
3 17 16 
A= 2 4 -2 4, 
6 18 -12 
then upon exit 
6 18 -12 
A=} 1/3 8 16 
1/2 -1/4 6 


and p = (3, 3]. These two quantities encode all the information associated with the 
reduction: 


1 0 0 1.0 0 10 0 0 0 l 6 18 -12 
0 1 0 G 0 i -1/3 1 0 ð 10|A2|0 8 16 |. 
0 1/4 1 0 1 Q0 -1/2 0 1 10 Q0 0 0 5 


3.4.4. Where is L? 


Gaussian elimination with partial pivoting computes the LU factorization of 
a row permuted version of A. The proof is a messy subscripting argument. 


Theorem 3.4.1 If Goussian elimination with partial pivoting is used to 
compute the upper triangularization 


Ma-1Ea-1 = MEA = U (3.4.1) 


via Algorithm 3.4.1, then 
PA= LU 


where P = En-1 +: E, and L is a unit lower triangular matriz with |f;| < 
1. The kth column of L below the diagonal is a permuted version of the 
kth Gauss vector. In particular, if My = I — rU eT, then L(k + En, k) = 
g(k + 1:n) where g = E41 Egg ar m. 


Proof. A manipulation of (3.4.1) reveals that M, ,--- M;PA = U where 
Mya = M4-1 and - 


My = EQ ERaMkERacc En — kEn-2. 
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Since each E; is an interchange permutation involving row j and a TOW jf 
with u > j we have E,(1:7 —1,1:7 — 1) = Ij: . It follows that each Mj is 
a Gauss transform with Gauss vector 7) = E,_1---Eegir). 0 


As a consequence of the theorem, it is easy to see how to change Algorithm 
3.4.1 so that upon completion, A(i,j) houses L(i,j) for all i > j. We 
merely apply each E, to all the previously computed Gauss vectors. This 
is accomplished by changing the line ^A(k, kin) + Alu, kiny” in Algorithm 
3.4.1 to “A(k, 1:n) = A(s, Lin).” 


Example 3.4.2 The factorization PA = LU of the matrix in Example 3.4.1 is given by 


00 1 3 17 10 1 0 0 6 18 -12 
i0 080 2 4 -2 |=] 1/2 ia 0 8 16 |. 
0 10 6 18 -12 1/34 -1/4 14 0 0 6 


3.4.5 The Gaxpy Version 


In $3.2 we developed outer product and gaxpy schemes for computing the 
LU factorization. Having just incorporated pivoting in the outer product 
version, it is natural to do the same with the gaxpy approach. Recall from 
(3.2.5) the general structure of the gaxpy LU process: 


L-I 
U=0 
for j = lin 
i£jz1 
v(j:n) = A(j-n,j) 
alse 
Solve L(1:5 — 1, 1:7 - 1)z = A(1:j - 1, į) for z 
and set U(1:7 — 1,5) =z. 
v(j:n) = A(jen, 7) - Lijn, 1:5 — 1)z 
end 
ifj<n 
LG + Lin, j)  vG + In) /u(3) 
end 
U(j,9) = v0) 
end 


With partial pivoting we search ju(j:n)| for its maximal element and pro- 
ceed accordingly. Assuming A is nonsingular so no zero pivots are encoun- 
tered we obtain 
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L=i7=0 
for j = lin 
if jl 
v(m) = Aln, j) 
else 


Solve L(1:5 — 1, 1:7 — Ljz = A(1:3 — 1,5) 
for z and set U(1:3 — 1,7) =z. 
v(j:) = Alijn, j) - LG:n, 1:3 — 1)z 


end (3.4.2) 
if j «n 
Determine u with k < 4 € n so wlia = || v(j:n) |... 
pj) =p 
v7) = v(a) 


A(j,j + Ln) = Au, j + En) 
L(j + 1m, j) = v(3 + Ln) /v) 
ifj>1 
L(j,1:j — 1) o L(g, 1: — 1) 

end 

end 

U(j,j) = vl) 

end 


In this implementation, we emerge with the factorization PA — LU where 
P= E,_;--- E1 where E, is obtained by interchanging rows k and p(k) of 
the n-by-n identity. As with Algorithm 3.4.1, this procedure requires 2n? /3 
flops and O(n?) comparisons. 


3.4.6 Error Analysis 

We now examine the stability that is obtained with partial pivoting. This 
requires an accounting of the rounding errors that are sustained during 
elimination and during the triangular system soiving. Bearing in mind 
that there are no rounding errors associated with permutation, it is not 
hard to show using Theorem 3.3.2 that the computed solution < satisfies 
(A+ E)z = b where 


IE| < nu (314] + SÊTIŻIĜI) + Otu’). (3.4.3) 


Here we are assuming that P, L, and Ü are the computed analogs of P, 
L, and U as produced by the above algorithms. Pivoting implies that the 
elements of L are bounded by one. Thus || Ll; € n and we obtain the 
bound 


[Eo < nu (3l Allo + Srl] T Yoo) + Olu). (3.4.4) 
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The problem now is to bound || U {loo Define the growth factor p by 
la} 
ijk lA lho 


p= (3.4.5) 


where A(*) is the computed version of the matrix A) = My Ep ++: M Eh A. 
Tt follows that 
| Ella < 8n%pl| Allou + O(u?). (3.4.6) 


Whether or not this compares favorably with the ideal bound (3.3.1) hinges 
upon the size of the growth factor of p. (The factor n? is not an operating 
factor in practice and may be ignored in this discussion.) The growth factor 
measures how large the numbers become during the process of elimination. 
In practice, p is usually of order 10 but it can also be as large as 2"—!. De- 
spite this, most numerical analysts regard the occurrence of serious element. 
growth in Gauasian elimination with partial pivoting as highly unlikely in 
practice. The method can be used with confidence. 


Example 3.4.3 If Gaussian elimination with partial pivoting is applied to the problem 
001 100} f 2] [100 
1.00 2.00 tz} | 3.00 

with £ = 10, t = 3, floating point arithmetic, then 


01] ; 100 0] + L00 200 
p=|[} Hr i= | on 100 ]- ô= | 0 1.00 


and $ = (1.00, .996)7. Compare with Example 3.3.1. 


Example 3.4.2 If A € R°™" is defined by 
l ifi=jorj=n 
aj = -1 ifi>j 
0 otherwise 
then A has an LU factorization with |/;] € 1 and unn = 2^7. 


3.4.7 Block Gaussian Elimination 
Gaussian Elimination with partial pivoting can be organized so that it is 
rich in level-3 operations. We detail a block outer product procedure but 
block gaxpy and block dot product formulations are also possible. See 
Dayde and Duff (1988). 

Assume A € IR"*" and for clarity that n = rN. Partition A as follows: 


Az |A4n At r 
Án Aga n—r 


r n—r 
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The first step in the block reduction is typical and proceeds as follows: 


e Use scalar Gaussian elimination with partial pivoting (e.g. a rectan- 
gular version of Algorithm 3.4.1) to compute permutation P, € R°**, 
unit lower triangular Lj; € R~" and upper triangular U,; € IC" so 


Au i _ | Lis 
al a | 7 [2 Jon 


e Apply the P, across the rest of A: 


+ Solve the lower triangular multiple right hand side problem 


LuUi = Ar. 


« Perform the level-3 update 
A= Ay -LauUa. 


With these computations we obtain the factorization 
_ | £n 0 I 0 Ui Un 
PA= | La Ir | | Ü A 0 In-r ` 
The process is then repeated on the first r columns of A. 

In general, during step k (1 < k < N — 1) of the block algorithm we 
apply scalar Gaussian elimination to a matrix of size (n — (k — I)r)-by-r. 
Àn r-by-(n — kr) multiple right hand side system is solved and a level 3 
update of size (n — kr)-by-(n — kr) is performed. The level 3 fraction for 


the overall process is approximately given by 1 — 3/(2N). Thus, for large 
N the procedure is rich in matrix multiplication. 


3.4.8 Complete Pivoting 


Another pivot strategy called complete pivoting has the property that the 
associated growth factor bound is considerably smaller than 2^7!. Recall 
that in partial pivoting, the kth pivot is determined by scanning the current 
subcolumn A(k:n,k). In complete pivoting, the largest entry in the cur- 
rent submatrix A(k:n, kin) is permuted into the (k, k} position. Thus, we 
compute the upper triangularization Ma. E, ,-:- Mu|ELAF) Fn- = U0 
with the property that in step & we are confronted with the matrix 


AF-D = M, EL, M Ey AP, Fei 
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and determine interchange permutations E, and F, such that 


= max 


|(z#<4*-F) al mex 


(EAD R) |. 


ij 


We have the analog of Theorem 3.4.1 


Theorem 3.4.2 If Gaussian elimination with complete pivoting is used to 
compute the upper triangularization 


Ma-1En-1 M EBAR Fa = U (3.4.7) 


then 
PAQ = LU 


where P = En-1 E1, Q= Fi Fa- and L is a unit lower triangular 
matriz with |£;| € 1. The kth column of L below the diagonal is a permuted 
version of the kth Gauss vector. In particular, if M, = I — T eT then 
E(k + ln, k) = g(k + lin) where g = Eng: Expr ™ . 


Proof. The proof is similar to the proof of Theorem 3.4.1. Details are left 
to the reader. B 


Here is Gaussian elimination with complete pivoting in detail: 


Algorithm 3.4.2 (Gaussian Elimination with Complete Pivoting) 
This algorithm computes the complete pivoting factorization PAQ = LU 
where L is unit lower triangular and U is upper triangular. P = E,,_1--- Ei 
and Q = F,---F,-1 are products of interchange permutations. A(1:k, k) 
is overwritten by U(1:k,k),k = in, A(k + 1:n, k) is overwritten by L(k + 
Lm,k)k = i:n — 1. Ek interchanges rows k and pik). Fk interchanges 


columns k and 9(k). 


for k=1:n—-1 
Determine p with k € p € n and A with k < A € n so 
LA(u, A)| = max{ |A(i, j)] : i = kin, j = ken} 
A(k, in) + A(n, L:n) 
A(lin, k} = A(1:n, A} 


p(k)- n 
q(k) = 2 
if A(k, k) #0 


rows = k + Ln 
A(rows, k) = A(rows, k)/A(k, k) 
A(raws, rows) = A(rows, rows) — A(rows, k).A(k, rows) 
end 
end 
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This algorithm requires 2/3 flopa and O(n?) comparisons. Unlike partial 
pivoting, complete pivoting involves a significant averhead because of the 
two-dimensional search at each stage. 


3.4.9 Comments on Complete Pivoting 


Suppose rank(A) = r < n. It follows that a£ the beginning of step r +1, 
A(r lm, r--1:n) = 0. This implies that E, = Fk = M; = I for k - r+ iin 
and so the algorithm can be terminated after step r with the following 
factorization in hand: 


8 [£u 0j|Un Un 
pag = w = | 7H "ll 1 j. 


Here Lj; and Uj, are r-by-r and Ly, and Uf, are (n — r)-by-r. Thus, 
Gaussian elimination with complete pivoting can in principle be used to 
determine the rank of a matrix. Yet roundoff errors make the probability 
of encountering an exactly zero pivot remote. In practice one would have to 
"declare" A to have rank k if the pivot element in step k -- 1 was sufficiently 
small. The numerical rank determination problem is discussed in detail in 
$5.4. 

Wilkinson (1961) has shown that in exact arithmetic the elements of 
the matrix A) = MLE, -- My EL AF; +- F, satisfy 


al) < p32. 3M2... RYE) P maxja,;|. (3.4.8) 


The upper bound is a rather slow-growing function of k. This fact coupled 
with vast empirical evidence suggesting that p is always modestly sized (e.g, 
p = 10) permit us to conclude that Gaussian elimination with complete 
pivoting is stable. The method soives a nearby linear system (A+ E) = b 
exactly in the sense of (3.3.1). However, there appears to be no practical 
justification for choosing complete pivoting over partial pivoting except in 
cases where rank determination is an issue. 


Example 3.4.0 If Gaussian elimination with complete pivoting is applied to the prob- 


lem 
001 1.00 zl = 1.00 
1.00 2.00 Z4 3.00 
with 8 = 10,2 = 3, floating arithmetic, then 


0 1 [9 1 100 0.00 5; .. [200 1.00 
p=[{ SE 9-|i1 ah L= | o 1.00 j' 0 = [ooo too | 


and $ = [1.00, 1.00|7. Compare with Examples 3.3.1 and 3.4.3. 


3.4.10 The Avoidance of Pivoting 


For certain classes of matrices it is not necessary to pivot. It is important 
to identify such classes because pivoting usually degrades performance. To 
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illustrate the kind of anslysis required to prove that pivoting can be safely 
avoided, we consider the case of diagonally dominant matrices. We say that 
Ac RP" is strictly diagonally dominant if 


n 
las] > $ lay) f=. 
{m1 


jhi 


The following theorem shows how this property can ensure a nice, no- 
pivoting LU factorization. 


Theorem 3.4.3 If AT is strictly diagonally dominant, then A has an LU 
factorization and |l] < 1. In other words, if Algorithm 3.4.1 is applied, 
then P= I. 
Proof. Partition A as follows 
a w? 

a= [oo] 
where a is I-by-1 and note that after one step of the outer product LU 
process we have the factorization 


a w^] — 1 0][:1 0 a wT 
voj vja I]|0 C-vwT/a 0 Ij? 
The theorem follows by induction on n if we can show that the transpose 


of B = C = vw" /a is strictly diagonally dominant. This is because we may 
then assume that B has an LU factorization B = LiU, and that implies 


1 Ü aw] _ 
anfa nlla o |= 


But the proof that BT is strictly diagonally dominant is straight forward. 
From the definitions we have 


n=l n=l -l jw | n-i 
Dla = So leis —viwy/al < Y esl + dei 2-4 
i=l im] imi fl 
ij izj ixj ipj 
Iw; 
€ (el — lus) + Tol - sj) 
TU ;U; 
< je; - 2| = a 
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3.4.11 Some Applications 


We conclude with some examples that illustrate how to think in terms of 
matrix factorizations when confronted with various linear equation situa- 
tions. 

Suppose A is nonsingular and n-by-n and that B is n-by-p. Consider the 
problem of finding X (n-by-p) so AX = B, i.e., the multiple right hand side 
problem. If X = [zi,..., £p] and B = [h,,..., 5, ] are column partitions, 
then 


Compute PA = LU. 


for k = lip 
Solve Ly = Pb, (3.4.9) 
Solve Ux, = y 

end 


Note that A is factored just once. If B = J, then we emerge with a 
computed A^! , 

As another example of getting the LU factorization “outside the loop,” 
suppose we want to solve the linear system A*r = b where A € R™*", 
b € R^, and k is a positive integer. One approach is to compute C = A* 
and then solve Cz = b. However, the matrix multiplications can be avoided 
altogether: 


Compute PA = LU 


for j = lk 
Overwrite b with the solution to Ly = Pb. (3.4.10) 
Overwrite b with the solution to Uz = b. 

end 


As 8 final example we show how to avoid the pitfall of explicit inverse 
computation. Suppose we are given A c R"*", dé R^, and ce R” and 
that we want to compute s = cT A-!d. One approach is to compute X = 
A7! as suggested above and then compute s = c7 Xd. A more economical 
procedure is to compute PA — LU and then solve the triangular systems 
Ly = Pd and Ux = y. It follows that s = c^ x. The point of this example is 
to stress that when a matrix inverse is encountered in a formula, we must 
think in terms of solving equations rather than in terms of explicit inverse 
formation. 


Problema 
P3.4.1 Let A = LU be the LU factorization of n-by-n A with |f;;| X 1. Let aT and uf 
denote the ith rows of A and U, respectively. Verify the equation 


i-1 
foo oP ah 


gel 
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and use it to show that || U lioo S 2" 7" (| A foo . (Hint: Take norms and use induction.) 


P$.4.3 Show that if PAQ = LU is obtained via Gaussian elimination with complete 
pivoting, then no element of U(i, in) is larger in absolute value than [u.s]. 

P3.4.3 Suppose A c R*** bas an LU factorization and that L and U are known. Give 
an algorithm which can compute tha (i, 7) entry of A~! in approximately (n—j)?--(n—i)? 
flope. 

P3.4.4 Suppose X is the computed inverse obtained via (3.4.9). Give an upper bound 
for | AX — I Ip. 

P3.4.5 Prove Theorem 3.4.2. 

P3.4.6 Extend Algorithm 3.4.3 so that it can factor an arbitrary rectangular matrix. 
P3.4.7 Write a detailed version of tbe block elimination algorithm outlined in 83.4.7. 
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3.5 Improving and Estimating Accuracy 


Suppose Gaussian elimination with partial pivoting is used to solve the n- 
by-n system Az = b. Assume t-digit, base floating point arithmetic is 
used. Equation (3.4.6) essentially says that if the growth factor is modest 
then the computed solution Ż satisfies 
(A*Eg$-b  [Ehe*ulAle u-;87.  — (351) 
In this section we explore the practical ramifications of this result. We begin 
by stressing the distinction that should be made between residua! size and 
accuracy. This is followed by a discussion of scaling, iterative improvement, 
and condition estimation. See Higham (1996) for a more detailed treatment 
of these topics. 
We make two notational remarks at the outset. The infinity norm is used 
throughout since it is very handy in roundoff error analysis and in practical 
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error estimation. Second, whenever we refer to “Gaussian elimination” in 
this section we really mean Gaussian elimination with some stabilizing pivot 
strategy such as partial pivoting. 


3.5.1 Residual Size Versus Accuracy 


The residual of a computed solution $ to the linear system Az = b is the 
vector b — Az. A small residual means that Az effectively "predicts" the 
right hand side b. From (3.5.1) we have || 5 — AZ Ilo = ull A [looll $ lloc 
and so we obtain 


Heuristic I. Gaussian elimination produces a solution $ with a relatively 
small residual. 
Small residuals do not imply high accuracy. Combining (3.3.2) and (3.5.1), 
we see that 
I = — 2 loc 
Il leo 
This justifies a second guiding principle. 


& uso A). (3.5.2) 


Heuristic IL If the unit roundoff and condition satisfy u ~ 1074 and 
Koo [.À) = 107, then Gaussian elimination produces a solution 2 that 
has about d — q correct decimal digits. 

If uxgo{A) is large, then we say that A is ill-conditioned with respect to 

the machine precision. 

As an illustration of the Heuristics Í and II, consider the system 


986 579) [1] _ [ 235 

409.237 | | za | ^ | .107 
in which xæ (A) = 700 and z = (2, —3)7. Here is what we find for various 
machine precisions: 


LZ=—2lloo | {b= Az llo 
Il llo TA Teoll feo 


Whether or not one is content with the computed solution = depends on 
the requirements of the underlying source problem. In many applications 
accuracy is not important but small residuals are. In such a situation, the 
& produced by Gaussian elimination is probably adequate. On the other 
hend, if the number of correct digits in 2 is an issue then the situation 
is more complicated and the discussion in the remainder of this section is 
relevant. 
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3.5.2 Scaling 
Let B be the machine base and define the diagonal matrices Di and Dy by 
D = diag“... Ar) 
D, = diag(6™... A). 


The solution to the n-by-n linear system At = b can be found by solving 
the scaled system (Di AD4)y = D, !b using Gaussian elimination and 
then setting z = Day. The scalings of A, b, and y require only O(n} fope 
and may be accomplished without roundoff. Note that D scales equations 
and D scales unknowns. 

It follows from Heuristic II that if and j are the computed versions of 
z and y, then . 


H2 (6 -z)de _ lij vo 
l| Dz*z lo TIT 


Thus, if &.o(D7' AD) can be made considerably smaller than &4,(A), then 
we might expect a correspondingly more accurate Z, provided errors are 
measured in the *D;" norm defined by || z lp, = || Dy'z ||. This is the 
objective of scaling. Note that it encompasses two issues: the condition 
of the scaled problem and the appropriateness of appraising error in the 
D3-norm. 

An interesting but very difficult mathematical problem concerns the 
exact minimization of «,(D, ' AD) for general diagonal D; and various 
p. What results there are in this direction are not very practical. This is 
hardly discouraging, however, when we recall that (3.5.3) is heuristic and 
it makes little sense to minimize exactly a heuristic bound. What we seek 
is a fast, approximate method for improving the quality of the computed 
solution $. 

One technique of this variety is simple row scaling. In this scheme D» ia 
the identity and D; is chosen so that each row in D; ‘A has approximately 
the same oo-norm. Row scaling reduces the likelihood of adding à very 
small number to a very large number during elimination—an event that 
can greatly diminish accuracy. 

Slightly more complicated than simple row scaling is row-column equi- 
libration. Here, the object is to choose D, and Da so that the co-norm 
of each row and column of D; ' AD4 belongs to the interval [1/5, 1] where 
B is the base of the floating point system. For work along these lines see 
McKeeman (1962). 

It cannot be stressed too much that simple row scaling and row-column 
equilibration do not "solve" the scaling problem. Indeed, either technique 
can render a worse £ than if no scaling whatever is used. The ramifications 
of this point are thoroughly discussed in Forsythe and Moler (1967, chap- 
ter 11). The basic recommendation is that the scaling of equations and 


=% use (Di AD). (3.5.3) 
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unknowns must proceed on a problem-by-problem basis. General scaling 
strategies are unreliable. It is best to scale (if at all) on the basis of what the 
source problem proclaims about the significance of each a;j. Measurement 
units and data error may have to be considered. 


Example 3.5.1 (Forsythe and Moler (1967, pp. 34, 40]) . if 
10 100,000 zi 2 100, 000 
1 I E+] -= 2 
and the equivalent row-acaled problem 
$001 fa] [1 
1 1 Z3 U 2 
are each solved using A = 10,t = 3 arithmetic, then solutions $ = (0.00, L00)T and 


2 = (1.00, 1.00)7 are respectively computed. Note that z = (1.0001..., .9999...)T is 
the exact solution. 


3.5.3 Iterative Improvement 


Suppose Az = b has been solved via the partial pivoting factorization PA = 
LU and that we wish to improve the accuracy of the computed solution =. 
If we execute 


r=b-Az 
Solve Ly = Pr. (3.5.4) 
Solve Uz = y. 


Inew =E+2 


then in exact arithmetic Az,,., = AZ+Az = (6—r)+r = b. Unfortunately, 
the naive floating point execution of these formulae renders an znew that is 
no more accurate than i. This is to be expected since # = fl(b — At) has 
few, if any, correct significant digits. (Recall Heuristic L) Consequently, 
$c fl(ACr)z AT! . noise = noise is a very poor correction from the 
standpoint of improving the accuracy of $. However, Skeel (1980) has done 
an error analysis that indicates when (3.5.4) gives an improved r44, from 
the standpoint of backwards error. In particular, if the quantity 


r = (UPALA foo) (max (Alls /min. (lx ) 


is not too big, then (3.5.4) produces an tnew such that (A+ E)z4,, = b 
for very small E. Of course, if Gaussian elimination with partial pivoting 
is used then the computed $ already solves a nearby system. However, 
this may not be the case for some of the pivot strategies that are used to 
preserve sparsity. In this situation, the fized precision iterative improvement 
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step (3.5.4) can be very worthwhile and cheap. See Arioli, Demmel, and 
Duff (1988). 

For (3.5.4) to produce a more accurate z, it is necessary to compute the 
residual b-- Az with extended precision floating point arithmetic. Typically, 
this means that if t-digit arithmetic is used to compute PA = LU, z, y, and 
z, then 2t-digit arithmetic is used to form b— AZ, i.e., double precision. The 
process can be iterated. In particular, once we have computed PA = LU 
and initialize z = 0, we repeat the following: 


r = b — Ar (Double Precision) 


Solve Ly = Pr for y. (3.5.5) 
Solve Uz = y for z. 
t=g+z 


We refer to this process as mired precision iterative improvement, The 
original A must be used in the double precision computation of r. The 
basic result concerning the performance of (3.5.5) is summarized in the 
following heuristic: 


Heuristic III. If the machine precision u and condition satisfy u = 1074 
and &(À) = 107, then after k executions of (3.5.5), z has approxi- 
mately min(d, k(d — q)) correct digits. 


Roughly speaking, if ux;,(À) < 1, then iterative improvement can ulti- 
mately produce a solution that is correct to full (single) precision. Note 
that the process is relatively cheap, Each improvement costs O(n?), to be 
compared with the original O(n*) investment in the factorization PA = LU. 
Of course, no improvement may result if A is badly enough conditioned with 
respect to the machine precision. 

The primary drawback of mixed precision iterative improvement is that 
its implementation is somewhat machine-dependent. This discourages its 
use in software that is intended for wide distribution. The need for retaining 
an original copy of A is another aggravation associated with the method. 

On the other hand, mixed precision iterative improvement is usually 
very easy to implement on a given machine that has provision for the ac- 
cumulation of inner products, i.e., provision for the double precision calcu- 
lation of inner products between the rows of A and zr. In a short mantissa 
computing environment the presence of an iterative improvement routine 
can significantly widen the class of solvable Ar = b problems. 


Example 3.5.2 [f (3.5.5) is applied to the system 
.986  .5T9 z | | 35 
409  .23T zz 107 
and ĝ = 10 and t = 3, then iterative improvement produces the following sequence of 


computed sohitions: 


i= 2.11 1.99 2.00 
-3.17 j° | -2.99 |° | -3.00 j> 
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The exact solution is z = [2, —3]T. 


3.5.4 Condition Estimation 


Suppose that we have solved Az = b via PA = LU and that we now wish 
to ascertain the number of correct digits in the computed solution £. It 
follows from Heuristic II that in order to do this we need an estimate of the 
condition #oa(A) = || A ikoll A7! flos. Computing || A [loo poses no problem 
as we merely use the formula 


n" 
A = max al. 
lálo = max J layl 


sign jel 


The challenge is with respect to the factor || A^! jlo. Conceivably, we 
could estimate this quantity by || X ||;5, where X = [2,...,£4] and $i 
is the computed solution to Az; = e,. (Ses 83.4.9.) The trouble with this 
approach is its expense: Ro = || A |l,ol| X ||.) costs about three times as 
much as Zz. 

The central problem of condition estimation is how to estimate the 
condition number in O(n?) flops assuming the availability of PA = LU or 
some other factorizations that are presented in subsequent chapters. An 
approach described in Forsythe and Moler (SLE, p. 51) is based on iterative 
improvement and the heuristic uso, (À) sx || z l[oo/l| z [loo where z is the first 
correction of x in (3.5.5). While the resulting condition estimator is O(n), 
it suffers from the shortcoming of iterative improvement, namely, machine 
dependency. 

Cline, Moler, Stewart, and Wilkinson (1979) have proposed a very suc- 
cessful approach to the condition estimation problem without this flaw. It 
is based on exploitation of the implication 


Ay=d => JA llo 2 Iy lood lloc 


The idea behind their estimator is to choose d so that the solution y is large 
in norm and then set 


Roo = || A lleoll v loc /li d llo- 


The success of this method hinges on how close the ratio || y |«/1] d ||, is 
to its maximum value || A^! |loo. 

Consider the case when A = T is upper triangular. The relation between 
d and y is completely specified by the following column version of back 
substitution: 


p(l:a) = 0 
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fork =n: — 1:1 
Choose d(k}. 
y(k) = (d(k) ~ p(k))/T(k, k) (3.5.6) 


p(l:k — 1) = p(L:k — 1) + WAST (1k — 1, k) 
end 


Normally, we use this aigorithim to solve a given triangular system Ty = d. 
Now, however, we are free to pick the right-hand side d subject to the 
“constraint” that y is large relative to d. 

One way to encourage growth in y is to choose d(k) from the set 
{-1, +1} so as to maximize y(&). If p(k) > 0, then set d(k) = —1. If 
p(k) < 0, then set d(k) = +1 . In other words, (3.5.6) is invoked with d(k) 
= -sign(p(k)). Since d is then a vector of the form d(1:n) = (+1,...,£1)7, 
we obtain the estimator &g, = [| T llacll v lioe. 

A more reliable estimator results if d(k) € {—1, +1} is chosen so as 
to encourage growth both in y(k) and the updated running sum given by 
p(l:k — 1, k) + T(1:k — 1, k)y(k). In particular, at step k we compute 


y(kK)* = (1 — p(k))/T(k, k} 
s(k)* = jy(&)*| + || p(k — 1) + T(Ekk — 1, E)g(K)* Hl, 
y(k)" = (71- p(K)/T(& k} 


s(k)" = |y(k)" | + I] p(k — 1) + T(EK — 1, E)y(k)^ [lh 


and set 


y(k)* ifs(K)* 2 s(k)" 
y(k) = 


y(k) if s(k)* < s(k)" 
This gives 


Algorithm 3.5.1 (Condition Estimator) Let T € I~" be a nonsin- 
gular upper triangular matrix. This algorithm computes unit oo-norm y 
and a scalar « so || Ty llo = 1/§ T! llog and s =% &;s(T) 


piln) = 
for k2n:- 1:1 
y(k)* = (1 = p(K))/T(E, k) 
y(k)* = (-1 — p(k))/T (k, k) 
p(k)* = p(1:k — 1) + T(l:k — 1, k)y(k)* 
p(k) = p(k — 1) + T(1:k — 1, k)y(k)- 
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if [y(k)*] + IEX)* Il, > fy) + Mele) Hs 


y(k) = v(k)* 
p(1:k — 1) = p(k)* 
else 
y(k) = y(k)~ 
p(1:k — 1) = p(k)” 
end 
end 
s = || y lool T leo; 
y=y/ll v lcs 


The algorithm involves several times the work of ordinary back substitution. 

We are now in a position to describe a procedure for estimating the 
condition of a square nonsingular matrix A whose PA = LU factorization 
we know: 


e Apply the lower triangular version of Algorithm 3.5.1 to UT and ob- 
tain a large norm solution te UT y = d. 


Solve the triangular systems LTr = y, Lw = Pr, and Uz = w 
* Roo =H A [lool z Mao ll T loo: 
Note that || z llo» < || A7! llooll r loo. The method is based on severai heuris- 


tics. First, if A is ill-conditioned and PA = LU, then it is usually the case 
that U is correspondingly ill-conditioned. The lower triangle L tends to be 
fairly well-conditioned. Thus, it is more profitable to apply the condition 
estimator to U than to L. The vector r, because it solves AT Pr = d, 
tends to be rich in the direction of the left singular vector associated with 
Smin( A). Righthend sides with this property render large solutions to the 
problem Az =r. 

In practice, it is found that the condition estimation technique that we 
have outlined produces good order-of-magnitude estimates of the actual 
condition number. 


Problems 


P3.5.1 Show by example that there may be more than one way to equilibrate a matrix. 


P3.5.2 Using B = 10,¢ = 2 arithmetic, solve 


Us FILE] [s] 


using Gaussian elimination with partial pivoting. Do one step of iterative improvement 
using f = 4 arithmetic to compute the residual. (Do not forget to round the computed 
residual to two digits.) 

P3.5.3 Suppose P(A+ E) = LO, where P is a permutation, £ is lower triangular with 
[45] < 1, and Ü is upper triangular. Show that %oo(A) > || A lloo/{|l E loo + 4) where 
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à = min [ü;;]. Conclude that if a small pivot is encountered when Gaussian elimination 
with pivoting is applied to A, then A is ill-conditioned. The converse is not true. (Let 
A= Bn). 

P3.5.4 (Kahan 1965) The system Az = b where 


2 -1 t 21 + 19719) 
A= | -1 10719 10719 b= —10-10 
1 10719 10719 19-9 


bas solution z = (10719 — 1 1)T. (a) Show that if (4 + E)y = b and [E] € 107°|Al, 
then jz — y| € 1077 |x]. That is, small relative changes in A's entries do not induce large 
changes in z even though xo; (A) = 10!9. (b) Define D = diag(1075,105, 105). Show 
Kec (DAD) < 5. (c) Explain what is going on in terms of Theorem 2.7.3. 

P3.5.5 Consider the matrix: 


1 0 M -M 
G 1 -M M 

T= oo 1 0 MER. 
G 0 0 1 


What estimate of &c, (T) is produced when (3.5.6) is applied with d(k) = —sign(p(k))? 
What estimate does Algorithm 3.5.1 produce? What is the true a (T)? 


P3.5.G What does Algorithm 3.5.1 produce when applied to the matrix B, given in 
(2.7.9)? 
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Chapter 4 


Special Linear Systems 


84.1 The LDM® and LDLT Factorizations 
§4.2 Positive Definite Systems 

§4.3 Banded Systems 

$4.4 Symmetric Indefinite Systems 

§4.5 Block Systems 

$4.6 Vandermonde Systems and the FFT 
§4.7 Toeplitz and Related Systems 


It is a basic tenet of numerical analysis that structure should be ex- 
ploited whenever solving a problem. In numerical linear algebra, this trans- 
lates into an expectation that algorithms for general matrix problems can 
be streamlined in the presence of such properties as symmetry, definiteness, 
and sparsity. This is the central theme of the current chapter, where our 
principal aim is to devise special algorithms for computing special variants 
of the LU factorization. 

We begin by pointing out the connection between the triangular fac- 
tors L and U when A is symmetric. This is achieved by examining the 
LDM" factorization in $4.1. We then turn our attention to the important 
case when A is both symmetric and positive definite, deriving the stable 
Cholesky factorization in 54.2. Unsymmetric positive definite systems are 
also investigated in this section. In 84.3, banded versions of Gaussian elimi- 
nation and other factorization methods are discussed. We then examine the 
interesting situation when A is symmetric but indefinite, Our treatment of 
this problem in $4.4 highlights the numerical analyst’s ambivalence towards 
pivoting. We love pivoting for the stability it induces but despise it for the 
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structure that it can destroy. Fortunately, there is a happy resolution to 
this conflict in the symmetric indefinite problem. 

Any block banded matrix is also banded and so the methods of $4.3 are 
applicable. Yet, there are occasions when it pays not to adopt this point of 
view. To illustrate this we consider the important case of block tridiagonal 
systems in $4.5. Other block systems are discussed as weil. 

In the final two sections we examine some very interesting O(n?) algo- 
rithms that can be used to solve Vandermonde and Toeplitz systems. 


Before You Begin 


Chapter 1, §§2.1-2.5, and $2.7, and Chapter 3 are assumed. Within this 
chapter there are the following dependencies: 


84.5 
T 
$1 > $42 — M3 > §44 
l 
$46 — 847 


Complementary references include George and Liu (1981), Gill, Murray, 
and Wright (1991), Higham (1996), Trefethen and Bau (1996), and Demmel 
(1996). Some MATLAB functions important to this chapter: chol, tril, 
triu, vander, toeplitz, fft. LAPACK connections include 


[ — LAPAOK: General Band Matrices | 
Solve AX = B 
Condition estimator 
Improve AX = B, AT X = B, A" X = B solutions with error bounds 
Solve AX = B, AT X = B, AU X = B with condition estimate 
PA=LU 
Solve AX = B, ATX = B, AY X = B vis PA = LU 


LAPACK: General Tridiagonal Matrices . 


Condition estimator 
Improve AX = B, AT X = B, AH X = B solutions with error bounds 


Solve AX = B, AT X = B, A" X = B with condition estimate 
PA=LU 
Solve AX = B, ATX = B, AV X = B via PA = LU 


e AX zB 
Condition estimate via PA = LU 
Improve AX = B solutions with error bounds 
Solve AX = B with condition estimate 
A= GGT 
Solve AX = B via A = GGT 
A- 
Equilibration 


LAPACK: Full Symmetric Positive Deflnite 
POSY 


HEHH 
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PESY Solve AX = B 
Condition estimate via A = GGT 
Improve AX = B solutions with error bounds 
Solve AX = 8 with condition estimate 
A2GGT 
Solve AX — H via A — GGT 


Condition estimate via A = LDLT 

Improve AX = B solutions with error bounds 
Solve AX — B with condition estimate 

A= LDLT 

Solve AX = B via A= LDLT 


Condition estimate via PAPT = LDLT 
Improve AX = B solutions with error bounds 
Solve AX = B with condition estimate 
PAPT = LDLT 

Solve AX = B via PAPT = LDLT 

A- 


Condition estimate 
Improve AX = B, AT X = B solutions with error bounds 
Solve AX = B, ATX - B 


4.1 The LDM? and LDL’ Factorizations 


We want to develop a structure-exploiting method for solving symmetric 
Az = b problems. To do this we establish a variant of the LU factorization 
in which A is factored into a three-matrix product LDMT where D is 
diagonal and L and M are unit lower triangular. Once this factorization is 
obtained, the solution to Ar = b may be found in O(n?) flops by solving 
Ly = b (forward elimination), Dz = y, and M?z = z (back substitution). 
The reason for developing the LDMT factorization is to set the stage for 
the symmetric case for if A = AT then Z = M and the work associated 
with the factorization is half of that required by Gaussian elimination. The 
issue of pivoting is taken up in subsequent sections. 


4.1.1 The LDM? Factorization 
Our first result connects the LDMT factorization with the LU factorization. 
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Theorem 4.1.1 if all the leading principal submatrices of A € R°”? are 
nonsingular, then there exist unique unit lower triangular matrices L and M 
and a unique diagonal matriz D = diag(di,..., ds) such that A= LDMT. 


Proof. By Theorem 3.2.1 we know that A has an LU factorization A = LU. 
Set D = diag(d;,...,d4) with d; = «4 for i = l:n. Notice that D is non- 
singular and that MT = D-1U is unit upper triangular. Thus, A= LU = 
LD(D-'U) = LDMT. Uniqueness follows from the uniqueness of the LU 
factorization as described in Theorem 3.2.1. O 


The proof shows that the LDMT factorization can be found by using Gaus- 
sian elimination to compute A = LU and then determining D and M from 
the equation U — DMT. However, an interesting alternative algorithm can 
be derived by computing L, D, and M directly. 

Assume that we know the first j — 1 columns of £L, diagonal entries 
d;,..., d, 1 of D, and the first j — 1 rows of M for some j with <j <n. 
To develop recipes for L(j + 1:n,j), M(j,1:7 — 1), and d; we equate jth 
columns in the equation A = LDMT. In particular, ` 


A(l:n, j) = Lv (4.1.1) 


where v = DMTe;. The “top” half of (4.1.1) defines v(1:7) as the solution 
of a known lower triangular system: 


L(1:5, :3)w(1:3) = ACz, j) . 
Once we know v then we compute 


dj) = »(3} 
M(ji) = wi)/di)  i-ij-1 


The “bottom” half of (4.1.1) says L(j + Lin, 1:3)v(1:3) = A( ln, 7) which 
can be rearranged to obtain a recipe for the jth column of L: 


L{j + l:n, jje(j) = AQ + beng) - LO + Ln, 1: - 1»(13 — 1). 


Thus, Z(j + 1:n, j) is a scaled gaxpy operation and overall we obtain 


for j = Ln 
Solve EO, 13)e(:3) = A(i:j, j) for v(1:7). 
or i = 1:7 ~ 
, M69 = wena (4.1.2) 
d(j) = v) 
Lj + lin, j) = 
4 (A(j + Lin, 7) - L(j + Un, 1:j — 1D)v(1:j - 1) /v(3) 
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As with the LU factorization, it is possible to overwrite A with the L, D, 
and M factors. If the column version of forward elimination is used to solve 
for u(1:7) then we obtain the following procedure: 


Algorithm 4.1.1 (LDM™) If A € FP"? has an LU factorization then 
this algorithm computes unit lower triangular matrices L and M and a 
diagonal matrix D = diag(d),...,d,) such that A = LDMT. The entry 
à,; is overwritten with Zi; if i > 7 , with d; if i = j, and with mj; if i < j. 


for j = l:n 
{ Solve L(1:j, 1j)w(1:3) = AC, j). } 
v(1:3) = A(5,) 
for k: 153-1 
v(k + 1:7) = v(k + 1:7} — v(k) A(k + 1:5, k) 
end 
{ Compute M(j, 1:7 — 1) and store in A(1:j - 1,5). } 
for i = 1:j —1 
A(i, j) = v(i) /A(i, i) 
end 


{ Store d(j) in A(j, j). } 
AÇ} j) = v(3) 
{ Compute L(j + l:n, j) and store in A(j + Ln, j) } 
for k= 1:7 ~1 
Aj + lin) = A(j + Lin, j} — v(k) A(j + lin, k) 
end 
A(j + Ln, j) = AG + Ln, j)/v(3) 
end 


This algorithm involves the same amount of work as the LU factorization, 
about 2n?/3 flops. . 

The computed solution £ to Az = b obtained via Algorithm 4.1.1 and 
the usual triangular system solvers of $3.1 can be shown to satisfy a per- 
turbed system (A + E)2 = b, where 


|E| X nu (si + 5{£\\Di1a7 |) +O(u?) (4.1.3) 


and L, D, and M are the computed versions of L, D, and M, respectively. 

As in the case of the LU factorization considered in the previous chapter, 
the upper bound in (4.1.3) is without limit unless some form of pivoting is 
done. Hence, for Algorithm 4.1.1 to be a practical procedure, it must be 
modified so as to compute a factorization of the form PA = LDMT , where 
P is a permutation matrix chosen so that the entries in L satisfy 14,;| < 1. 
The details of this are not pursued here since they are straightforward and 
since our main object for introducing the LDMT factorization is to motivate 
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special methods for symmetric systems. 


Example 4.1.1 


10 10 26 1800 10 0 0 11 2 
As 20 325 40 | = 2.1 Q0 0 50 0.10 
30 50 61 3.4 1 0 0 1 0 0 1 


and upon completion, Algorithm 4.1.1 overwrites A az follows: 


10 1 2 
A= 25 OF. 
3.4 1 


4.1.2 Symmetry and the LDL? Factorization 
There is redundancy in the LDMT factorization if A is symmetric. 


Theorem 4.1.2 If A = LDMT is the LDMT factorization of a nonsin- 
gular symmetric matriz A, then L = M. 


Proof. The matrix M^! AM^7 = M-!LD is both symmetric and lower 
triangular and therefore diagonal. Since D is nonsingular, this implies 
that M71L is also diagonal. But M-ŻL is unit lower triangular and so 
M^L-1I1H 


In view of this result, it is possible to halve the work in Algorithm 4.1.1 
when it is applied to a symmetric matrix. In the jth step we already know 
M(j,1:j — 1) since M = L and we presume knowledge of L's first j — 1 
columns. Recall that in the jth step of (4.1.2) the vector v(1:7) is defined 
by the first j components of DMTe;. Since M = L, this says that 


d(1)LG 1) 


v(1:) = . 2E 
d(j ~ LG, j 7 1) 

d(j) 
Hence, the vector v(1:j — 1) can be obtained by a simple scaling of L's jth 
row. The formula v(j) = A(j, j) - L(j, 1:3 — 1)v(1:j — 1) can be derived 
from the jth equation in L(1:j, 1:7)v = A(1:5, j) rendering 


forj—lm 
fori-1j-1 
v(i) = L(,i)d(i) 
end 


v(j) = AG, 3) = L(j, 13 - ly(1:j -= 1) 
d(j) = v) 
LG + En, j) = 
(A(j + lin, j) — L(j + lm, 1: — Yel — 3) /v(3) 
end 
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With overwriting this becomes 


Algorithm 4.1.2 (LDLT) If A € B^*^ is symmetric and bas an LU 
factorization then this algorithm computes a unit lower triangular matrix 
L and a diagonal matrix D = diag(di,...,dn) so A = LDLT. The entry 
ai; is overwritten with Z; if i > j and with d; if i = j. 


for j = l:n 
{ Compute v(1:j). } 
for i= 1:j-1 


u(t) = A(z, i) ACE, #) 
end 
v(j) = AG, 3) - AG, 1:7 - Leij - 1) 
{ Store d{j) and compute L(j + Ln, 7). } 
AG, j) = v(7) 
A(j + Lin, j) = 
(AG + in, 7) — AG + En, 1: — 1)v(1:j — 1)/v() 
end 
This algorithm requires n3/3 flops, about half the number of flops involved 
in Gaussian elimination. 

In the next section, we show that if A is both symmetric and positive 
definite, then Algorithm 4.1.2 not only runs to completion, but is extremely 
stable. If A is symmetric but not positive definite, then pivoting may be 
necessary and the methods of 84.4 are relevant. 


Example 4.1.2 


10 20 (30 100 10 0 o 12 3 
A= 270 45 80 = 2 10 5.0 0 1 4 
30 80 171 3 4 1 0 1 00 1 


Problems 


P4.1.1 Show that the LDMT factorization of a nonsingular A is unique if it exists. 


P4.1.2 Modify Algorithm 4.1.1 so that it computes a factorization of the farm PA — 
LDMT, where L and M are both unit lower triangular, D is diagonal, and P is a 
permutation that is chosen so |é;;| < 1. 

P4.1.3 Suppose the n-by-n symmetric matrix A = (a4) is stored in a vector c as 
follows: c = (a11,021,..-,041,022, - - -s Gnzi---.Onn). Rewrite Algorithm 4.1.2 with A 
stored in this fashion. Get as much indexing outside the inner loops az possible. 
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P4.1.4 Rewrite Algorithm 4.1.2 for A stored by diagonal. See 31.2.8. 


Notes aud References for Sec. 4.1 


Algorithm 4.1.1 is related to the methods of Crout and Doolittle in that outer product 
updates are avoided. See Chapter 4 of Fax (1964) or Stewart (1973,131-149). An Algol 
procedure may be found in 


H.J. Bowdler, R-S. Martin, G. Peters, and J.H. Wilkinson (1966), “Solution of Real and 
Complex Systems of Linear Equations," Numer. Math. 8, 217-234. 


See also 


G.E. Forsythe (1960). *Crout with Pivoting,” Comm. ACM 3, 507-08. 
W.M. McKesman (1962). "Crout with Equilibration and Iteration,” Comm. ACM 5, 
353-55. 


Just as algorithms can be tailored to exploit structure, so can error analysis and pertur- 
bation theory: 


M. Arioli, J. Demmel, and I, Duff (1989). "Solving Sparse Linear Systems with Sparse 
Backward Error,” SIAM J. Matriz Anal. Appl. 10, 165-190. 

1R. Bunch, J.W. Demmel, and C.F. Van Loan (1989). “The Strong Stability of Algo- 
rithms for Solving Symmetric Linear Systema," SIAM J. Matriz Anal Appl. 10, 
494—499. 

A. Barrlund (1991). “Perturbation Bounds for the LDLT and LU Decompositions,” 
BIT 31, 358-363. 

DJ. Higham and N.J. Higham (1992). “Backward Error and Condition of Structured 
Linear Systema," SIAM J. Matriz Anal. Appl. 13, 162-175. 


4.2 Positive Definite Systems 
A matrix A c R°*" is positive definite if z7 Ax > 0 for all nonzero z € R”. 


Positive definite systems constitute one of the most important classes of 
special Ar = b problems. Consider the 2-by-2 symmetric case. If 


A= 9n 42 
ün az 


is positive definite then 


z = (1,07 = zTAr = ay >O0 
r = (0,1)T => zl Ar = ag>O 
z = (117 => zlAz = ay +2an+0n>0 
ro = (1,-1)? = zl Ar = ayy -aytay >Â. 


The last two equations imply aiz} € (a11 + a232)/2. From these results we 
see that the largest entry in is on the diagonal and that it is positive. This 
turns out to be true in general. A symmetric positive definite matrix has 
a “weighty” diagonal. The mass on the diagonal is not blatantly obvious 
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as in the case of diagonal dominance but it has the same effect in that it 
precludes the need for pivoting. See §3.4.10. 

We begin with a few comments about the property of positive definite- 
ness and what it implies in the unsymmetric case with reapect to pivoting. 
We then focus on the efficient organization of the Cholesky procedure which 
can be used to safely factor a symmetric positive definite A. Gaxpy, outer 
product, and block versions are developed. The section concludes with a 
few comments about the semidefinite case. 


4.2.1 Positive Definiteness 


Suppose A € IR"*" is positive definite. It is obvious that a positive definite 
matrix is nonsingular for otherwise we could find a nonzero z so z7 Ax = 0. 
However, much more is implied by the positivity of the quadratie form 
zT Ax as the following results show. 


Theorem 4.2.1 If A c R°*" is positive definite and X € IR'** has rank 
k, then B = XT AX € R**^ is also positive definite. 


Proof. If z € IR* satisfies 0 > zT Bz = (Xz)™A(Xz) then Xz = 0. But 
since .X has full column rank, this implies that z = 0.0 


Corollary 4.2.2 If A is positive definite then all its principal submatrices 
are positive definite. In particular, all the diagonal entries are positive, 


Proof. If v € IR* is an integer vector with 1 < vi < --- < vy € n, then 
X = 7,(:,v) is a rank k matrix made up columns «,..., vy of the identity. 
It follows from Theorem 4.2.1 that A(v,v) = X7 AX is positive definite. Cl 


Corollary 4.2.3 If A is positive definite then the factorization A = LDMT 
exists and D = diag(d),...,dn) has positive diagonal entries. 


Proof. From Corollary 4.2.2, it follows that the submatrices A(1:&, 1:k) 
are nonsingular for k = 1:n and so from Theorem 4.1.1 the factorization 
A = LDM" exists. If we apply Theorem 4.2.1 with X = L-T then B = 
DMTL-T = L-!AL-T is positive definite, Since MT LT is unit upper 
triangular, B and D have the same diagonal and it must be positive. O 
There are several typical situations that give rise to positive definite ma- 
trices in practice: 

* The quadratic form is an energy function whose positivity is guaran- 

teed from physical principles. 


+ The matrix A equals a cross-product XTX where X has full column 
rank. (Positive definiteness follows by setting A = J, in Theorem 
42.1.) 


e Both A and AT are diagonally dominant and each aş is positive. 
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4.2.2 Unsymmetric Positive Definite Systems 


The mere existence of an LDMT factorization does not mean that its com- 
putation is advisable because the resulting factors may have unacceptably 
large elements. For example, if c > 0 then the matrix 


as [a] E tH etm] GS] 


is positive definite. But if m/e >> 1, then pivoting is recommended. 
The following result suggests when to expect element growth in the 
LDMT factorization of a positive definite matrix. 


Theorem 4.2.4 Let A c R°™" be positive definite and set T = (A-- AT)/2 
and S = (A — AT)/2. If A= LDMT, then 


IZIDOM? le € (IT lla + 1 ST-!5 iz) (4.2.1) 
Proof. See Golub and Van Loan (1979). O 


The theorem suggests when it is safe not to pivot. Assume that the com- 
puted factors L, D, and M satisty: 
WEWDU MT ie « ell DIMT Ne, (4.2.2) 


where c is a constant of modest size. It follows from (4.2.1) and the analysis 
in §3.3 that if these factors are used to compute a solution to Ar = 6, then 
the computed solution < satisfies (A + £)2 = b with 


| Efe € u(3nl A lle + 5en? (IIT lla + ESTS 102)) + O(w?). (4.2.3) 
It is easy to show that || T ||; < || A l|. and so it follows that if 


_ | ST-'S ty 
2 = Taj 


is not too large then it is safe not to pivot. In other words, the norm of the 
skew part 5 has to be modest relative to the condition of the symmetric 
part T. Sometimes it is possible to estimate (1 in an application. This is 
trivially the case when A is symmetric for then €? = 0. 


(4.2.4) 


4.2.8 Symmetric Positive Definite Systems 


When we apply the above resulta to a symmetric positive definite system 
we know that the factorization A = LDL exists and moreover is stable to 
compute. However, in this situation another factorization is available. 
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Theorem 4.2.5 (Cholesky Factorization ) If A € JR^*" is symmetric 
positive definite, then there exists a unique lower triangular G € IR^ *" with 
positive diagonal entries such that A= GGT. 


Proof. From Theorem 4.1.2, there exists a unit lower triangular L and a 
diagonal D = diag(d;,...,d4) such that A = LDLT. Since the dy are pos- 
itive, the matrix G = L diag(/di,..., dp) is real lower triangular with 
positive diagonal entries. It also satisfies A = GGT. Uniqueness follows 
from the uniqueness of the LDLT factorization. 0 


The factorization A = GGT is known as the Cholesky factorization and G 

. is referred to as the Cholesky triangle. Note that if we compute the Cholesky 
factorization and solve the triangular systems Gy = b and GT z = y, then 
b= Gy = G(GTz) = (GG")z = Ax. 

Our proof of the Cholesky factorization in Theorem 4.2.5 is constructive. 
However, more effective methods for computing the Cholesky triangle can 
be derived by manipulating the equation A = GGT. This can be done in 
several ways as we show in the next few subsections. 


Example 4.2.1 The matrix 
[2 3)-[4 2G sh 30-4 3 a] l4 7] 


is positive definite. 


4.2.4 Gaxpy Cholesky 


We first derive an implementation of Cholesky that is rich in the gaxpy 
operation. If we compare jth columns in the equation A = GGT then we 
obtain 


j 
AGI) = $ GG. E)G( k). 


kml 
This says that 
j-1 
GIGI) = A(,3) - 3 GG. GGA) = v. (4.2.5) 
kml 


If we know the first ; — 1 columns of G, then v is computable. It follows 
by equating components in (4.2.5) that 


G(j:n, j) = v(j:n)/ A v). 


This is & scaled gaxpy operation and so we obtain the following gexpy-based 
method for computing the Cholesky factorization: 
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for j = Ln 
v(i:n) = A(j:n, 7) 
for k=1:j -1 
v(3:n) = v(j:n) — GU, K)G(j:n, k) 
end 
Glin, j) = vGi)/ eG) 


end 


It is possible to arrange the computations so that G overwrites the lower 
triangle of A. 


Algorithm 4.2.1 (Cholesky: Gaxpy Version) Given a symmetric 
positive definite A € IR"*", the following algorithm computes a lower tri- 
angular G € IR?*" such that A = GGT. For all i > j, G(i, j) overwrites 
A(i, j). 
for j = l:n 
ifj>1 
A(jn,j) = Alfin, j) - AG, Lj - NAG, 1:3 - 17 
end 
A(j:n, j) = Alin, j)/ v A(j,3) 


end 


This algorithm requires n?/3 flops. 


4.2.5 Outer Product Cholesky 


An alternative Cholesky procedure based on outer product (rank-1) updates 
can be derived from the partitioning 


a=($ 5] = | fo ill B-ja |l o E 
(42.6) 


Here, 8 = ya and we know that œ > 0 because A is positive definite. Note 
that B — vut /a is positive definite because it is a principal submatrix of 


XTAX where 
x«l! -v" fa 
~ 0 En-1 i 


If we have the Cholesky factorization G1GT = B —vv" /a, then from (4.2.6) 
it follows that A = GGT with 


_{ & 0 | 

G = | Hal 

Thus, the Cho.esky factorization can be obtained through the repeated 
application of (4.2.6), much in the the style of kji Gaussian elimination. 


4.2. Positive DEFINITE SYSTEMS 145 


Algorithm 4.2.2 (Cholesky: Outer product Version) Given a sym- 
metric positive definite A € R"*^, the following algorithm computes a lower 
triangular G € IR?*" such that A = GGT. For all i > j, GU, j) overwrites 
A(i, j). 


for k = 1m 
A(k, k) = 4/A(k,k) 
A(k + lin, k) = A(k + lin, k)/ A(k, k) 
for 7=k+1:n 
A(j:n, j) = Ain, j} — AG:n, kK) AG, k) 
end 
end 


This algorithm involves n?/3 flops. Note that the j-loop computes the lower 
triangular part of the outer product update 


A(k + n,k + in) = Afk + unk + lim) — Alk + Ln, k)A(k + lin, k)T. 


Recalling our discussion in §1.4.8 about gaxpy versus outer product up- 
dates, it is easy to show that Algorithm 4.2.1 involves fewer vector touches 
than Algorithm 4.2.2 by a factor of two. 


4.2.6 Block Dot Product Cholesky 


Suppose A € IR"*" is symmetric positive definite. Regard A = (A;;) and its 
Cholesky factor G = (G;;) as N-by-iV block matrices with square diagonal 
blocks. By equating (i,j) blocks in the equation A = GGT with i > j it 
follows that 


j 
Aij = 9. GaGÍ,. 
kal 
Defining a 
j- 
S = A5 Y GaGT, 
kml 


we see that Gj,GT, = S if i = j and that GGT, = S if i > j. Properly 


sequenced, these equations can be arranged to compute all the Gij: 


Algorithm 4.2.3 (Cholesky: Block Dot Product Version) Given a 
symmetric positive definite A € IR"*", the following algorithm computes a 
lower triangular G c R°*" such that A = GGT. The lower triangular part 
of A is overwritten by the lower triangular part of G. A is regarded as an 
N-by-N block matrix with square diagonal blocks. 
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forj—-LN 
for i= 7:N 
jel 
S = Ay - > GuGhe 
kal 
ifi=j 
Compute Cholesky factorization 5 = GiGi 
else 
Solve G,GT, = S for Gij 
end 


Overwrite Ai with Gi. 
end 
end 


The overall process involves n?/3 flops like the other Cholesky procedures 
that we have developed. The procedure is rich in matrix multiplication 
assuming a suitable blocking of the matrix A. For example, if n = rN and 
each Aj; is r-by-r, then the level-3 fraction is approximately 1 — (1/N?). 

Algorithm 4.2.3 is incomplete in the sense that we have not specified how 
the products Gi.G;, are formed or how the r-by-r Cholesky factorizations 
S = Gj;GT, are computed. These important details would have to be 
worked out carefully in order to extract high performance. 

Another block procedure can be derived from the gaxpy Cholesky algo- 
rithm. After r steps of Algorithm 4.2.1 we know the matrices Gi, € R^ 
and Ga; € R-X in 


[4 án ] - [ Jiz [g di 

An An] | Gar da 0 A Ga In-r | C 

We then perform r more steps of gaxpy Cholesky not on A but on the 
reduced matrix A = Az — Ga GI, which we ezplicitly form exploiting 
symmetry. Continuing in this way we obtain a block Cholesky algorithm 
whose kth step involves r gaxpy Cholesky steps on a matrix of order n — 


(k — 1)r followed a level-3 computation having order n — kr. The level-3 
fraction is approximately equal to 1 — 3/(2N) if n zrN. 


4.2.7 Stability of the Cholesky Process 


In exact arithmetic, we know that a symmetric positive definite matrix 
has & Cholesky factorization. Conversely, if the Cholesky process runs to 
completion with strictly positive square roots, then A is positive definite. 
Thus, to find out if a matrix A is positive definite, we merely try to compute 
its Cholesky factorization using any of the methods given above. 

The situation in the context of roundoff error is more interesting. The 
numerical stability of the Cholesky algorithm roughly follows from the in- 
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equality , 
4s > 9k = Oy. 
kml 


This shows that the entries in the Cholesky triangle are nicely bounded. 
The same conclusion can be reached from the equation |G| = i| A liz. 

The roundoff errors associated with the Cholesky factorization have 
been extensively studied in a classical paper by Wilkinson (1968). Using 
the results in this paper, it can be shown that if Z is the computed solution 
to Ar = b, obtained via any of our Cholesky procedures then = solves 
the perturbed system (A + E)£ = b where | E||; € null Alla and cn 
is a small constant depending upon n. Moreover, Wilkinson shows that if 
Qnux2(A) € 1 where gnis another small constant, then the Cholesky process 
runs to completion, i.e, no square roots of negative numbers arise. 


Example 4.2.2 If Algorithm 4.2.2 is applied to the positive definite matrix 
r00 15 AL 
A= 15 23 O01 
.01 .01 1.00 


and § = 10, t = 2, rounded arithmetic used, then ĝi; = 10, 1 = 1.5, 931 = .001 and 
$22 = 0.00, The algorithm then breaks down trying to compute gaz. 


4.2.8 The Semidefinite Case 


A matrix is said to be positive semidefinite if zT Az > 0 for all vectors 
z. Symmetric positive semidefinite (sps) matrices are important and we 
briefly discuss some Cholesky-like manipulations that can be used to solve 
various sps problems. Results about the diagonal entries in an sps matrix 
are needed first. 


Theorem 4.2.6 If A c R'*" is symmetric positive semidefinite, then 


les] € (as + 055) /2 (4.2.7) 

lajj S ea (G2 (4.2.8) 

max jag] = max a4 (42.9) 
3J i 

a,=0 = A(i:)=0, A(,i) 20 (4.2.10) 


Proof. If x = e; +e; then 0 X zT Az = ag + aj; + 2a;; while z = e; - ej 
implies 0 < zT Ar = ai + a5; — 2a;;. Inequality (4.2.7) follows from these 
two results. Equation (4.2.9) is an easy consequence of (4.2.7). 

To prove (4.2.8) assume without loss of generality that i = 1 and j = 2 
and consider the inequality 


T 
r Gi 412 T - 
ü « |i | [2n d [z] = z? + 2an + an 
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which holds since A(1:2, 1:2) is also semidefinite. This is a quadratic equa- 
tion in z and for the inequality to hold, the discriminant 4af} — 4a1:182 
must be negative. Implication (4.2.10) follows from (4.2.8). O 


Consider what happens when outer product Cholesky is applied to an sps 
matrix. If a zero A(k, k} is encountered then from (4.2.10) A(k:n, k) is zero 
and there is “nothing to do" and we obtain 
for k = lin 
if A(k, k) > 0 
A(k,k) = Ak, k) 
A(k + lin, k) = A(k + lin, k)/A(k, k) 
for j= k+ lin 
A(j:n, 3) = AG:n, j) - Ain, k) AG k) (4.2.11) 
end 
end 
end 


Thus, a simple change makes Algorithm 4.2.7 applicable to the semidefinite 
case, However, in practice rounding errors preclude the generation of exact 
zeros and it may be prefersble to incorporate pivoting. 


4.2.8 Symmetric Pivoting 


To preserve symmetry in a symmetric A we only consider data reorderings 
of the form PAP? where P is a permutation. Row permutations (A — PA) 
or column permutations (A — AP) alone destroy symmetry. An update of 
the form 

A PAPT 


is called à symmetric permutation of A. Note that such an operation does 
not move off-diagonal elements to the diagonal. The diagonal of PAPT is 
a reordering of the diagonal of A. 

Suppose at the beginning of the kth step in (4.2.11) we symmetrically 
permute the largest diagonal entry of A(k:n, k:n) into the lead position. 
If that largest diagonal entry is zero then A(k:n,k:n) = 0 by virtue of 
(4.2.10). In this way we can compute the factorization PAP? = GGT 
where G e EP *-U is lower triangular. 


Algorithm 4.2.4 Suppose A e R'*" is symmetric positive semidefinite 
and that rank( A) = r. The following algorithm computes a permutation P, 
the index r, and an n-by-r lower triangular matrix G such that PAPT = 
GGT. The lower triangular part of A(:,l:r) is overwritten by the lower 
triangular part of G. P = P,-+-- Pj where Py is the identity with rows k 
and piv(k) interchanged. 
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r=0 
for k = l:n 
Find q (k <q< n) so A(q, q) = max (A(k, k), 4, A(n, n)) 
if A(g,q) >0 
r=r+1 
piv(k)=q 
A(k,:) ++ Alq :} 
A(:, k) e AC, q) 
Alk, k) = y A(k, k} 
A(k + lin, k) = A(k + Ln, k)/A(k, k) 
for j=k+ 1:n 
Ajin, j} = Ajin, J) 7 Ain, k) AG, k} 
end 
end 
end 


In practice, a tolerance is used to detect small A(k, k). However, the sit- 
uation is quite tricky and the reader should consult Higham (1989). In 
addition, $5.5 has a discussion of tolerances in the rank detection problem. 
Finally, we remark that a truly efficient implementation of Algorithm 4.2.4 
would only access the lower triangular portion of A. 


4.2.10 The Polar Decomposition and Square Root 
Let A = ULE VT be the thin SVD of A € R™*" where m 2 n. Note that 
A= (UVT)(VZ,VT) = ZP (4.2.12) 


where Z = UVT and P = VX4VT, Z has orthonormal columns and P is 
symmetric positive semidefinite because 


a? Pr = (VTz)TYAVTz) =$ oy 20 
k=l 

where y = VTz. The decomposition (4.2.12) is called the polar decom- 
position because it is analogous to the complex number factorization z = 
erarg( zi. See $12.4.1 for further discussion. 

Another important decomposition is the matrix square root. Suppose 
A € R°™” is symmetric positive semidefinite and that A = GG? is its 
Cholesky factorization. If G = UEVT is G's SVD and X = UEUT, then 
X is symmetric positive semidefinite and 

A = GGT = (UEVTy(UZVT)? = UZ?UT = (UZUTYUTUT) = X*. 


Thus, X is a square root of A. It can be shown (moet easily with eigen- 
value theory) that a symmetric positive semidefinite matrix has a unique 
symmetric positive semidefinite square root. 
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Problems 


P4.2.1 Suppose that H = A - iB is Hermitian and positive definite with A, B c Er *". 
This means that z^ Hz > 0 whenever z 40. (a) Show that 


c= [5j 


is symmetric and positive definite. (b) Formulate an algorithm for solving (A+iB)(z+¢y) 
= (b + ic), where b, c, z, and y are in R”. it should involve 8n?/3 flops. How much 
storage is required? 

P4.2.2 Suppose A € A"™* is symmetric and positive definite. Give an algorithm for 
computing an upper triangular matrix R E€ R”*® such that A = ART. 

P4.2.3 Let A € EX" be positive definite and set T = (A--AT)/2 and 8 = (A— AT)/2. 
(a) Show that || A7! |a < | T7! [ja and zT A7!zx x zTT-!z for all z € R”. (b) Show 
that if A= LDMT, then dy > 1/| T7! ||a for k = in 

P4.2.4 Find a 2-by-2 real matrix A with the property that zT Az > 0 for all real nonzero 
2-vectors but which is not positive definite when regarded as a member of Qr 


P4.2.5 Suppose A € R”*™ has a positive diagonal. Show that if both A and AT are 
strictly diagonally dominant, then A ia positive definite. 

P4.2.6 Show that the function f(x) = (27 Az)/2 is a vector norm on R* if and only if 
A is positive definite. 

P4.2.7 Modify Algorithm 4.2.1 so that if the square root of a negative number is 
encountered, then the algorithm finds a unit vector z so zT Ar < 0 and terminates. 


P4.2.8 The numerical range W(A) of a complex matrix A is defined to be the set 
W(A) = (24 Az : zÉz = 1}. Show that if 0 ¢ W(A), then A has an LU factorization. 


P4.2.9 Formulate au m < n version of the polar decomposition for A c R"™*". 


P4.2.10 Suppose A = I +uu7 where A € RO*" and || u [[a = t. Give explicit formulae 
for the diagonal and subdiagonal of A's Cholesky factor. 


P4.2.11 Suppose A € R°*" is symmetric positive definite and that its Cholesky factor 
is available, Let ey = fn{:,k). For 1 <i < j € n, let aij be the smallest real that makes 
Ata(eeT +eseT) singular. Likewise, let a;; be the smallest real that makes (A+acje?) 
singular. Show how to compute these quantities using the Sherman-Morrison- Woodbury 
formula. How many flopa are required to find all the aj? 


Notes and References for Sec. 4.2 


The definiteness of the quadratic form r7 Az can frequently be established by considering 
the mathematics of the underlying problem. For example, the discretization of certain 
partial differentia] operators gives rise to provably positive definite matrices. Aspects of 
the unsymmetric positive definita problem are discussed in 


A. Buckley (1974). “A Note on Matrices A = I + H, H Skew-Symmetric,” Z. Angew. 
Mats. Mech, 54, 125-26. 
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A. Buckley (1977). “On the Solution of Certain Skew-Symmetric Linear Systema," SIAM 
J. Num. Anal 14, 568-70. 

G.H. Golub and C. Van Loan (1979). "Unsymmetric Positive Definite Linear Systema," 
Lin. Aig. and Ita Applic. 28, 85-98. 

R. Mathias (1992). "Matrices with Positive Definite Hermitian Part: Inequalities and 
Linear Systema," SIAM J. Matrix Anal, Appl. 13, 640-654. 


Symmetric positive definite systems constitute the most important class of special Az = b 
problems. Algoi programs for these problems are given in 


R.S. Martin, G. Peters, and J.H. Wilkinson (1965). “Symmetric Decomposition of a 
Positive Definite Matrix,” Numer. Math. 7, 362-83. 

R.S. Martin, G. Peters, and J.H. Wilkinson (1966). “Iterative Refinement of the Solution 
of a Positive Definite System of Equations,” Numer. Math. 8, 203-16. 

F.L. Bauer and C. Reinsch (1971). "Inversion of Positive Definite Matrices by the Gauss- 
Jordan Method,” in Handbook for Automatic Computation Vol 2, Linear Algebra, 
J.H. Wilkinson and C. Reinsch, eds. Springer-Verlag, New York, 45-49. 


The roundoff errors associated with the method are analyzed in 


J.-H. Wilkinson (1968). “A Priori Error Analysis of Algebraic Processes,” Proc. Inter- 
national Congress Math. (Moecow: Izdat, Mir, 1968), pp. 629-39, 

J. Meinguet (1983). “Refined Error Analyses of Choleaky Factorization,” SIAM J. Nu- 
mer. Anal 20, 1243-1250. 

A. Kielbasinski (1987). “A Note on Rounding Error Analysis of Cholesky Factorization,” 
Lin. Alg. and its Applic. 88/89, 481—494. 

N.J. Higham (1990). “Analysis of the Choleaky Decomposition of a Semidefinite Matrix,” 
in Reliable Numerical Computation, M.G. Cox and S.J. Hammarling (eda), Oxford 
University Press, Oxford, UK, 151-185. 

R. Carter (1091). “Y-MP Fleating Point and Cholesky Factorization,” int'l J. High 
Speed Computing 3, 215-222. 

J-Guang Sun (1992). “Rounding Error and Perturbation Bounds for the Cholesky and 
LDLT Factorizations," Lin. Alg. and its Applic. 173, 77-91. 


The question of how the Choleaky triangle G changes when A = GGT is perturbed is 
analyzed in . 


G.W. Stewart (19775). “Perturbation Bounds for the QR Factorization of a Matrix,” 
SIAM J. Num. Anai. 14, 509-18. 

Z. Dram&c, M. Omladit, and K. Veselit (1994). “On the Perturbation of the Cholesky 
Factorization,” SIAM J. Matriz Anal Appl 15,1319-1332. 


Nearness /sensitivity issues associated with positive semi-definiteness and the polar de- 
composition are presented in 


N.J. Higham (1988). "Computing a Nearest Symmetric Positive Semidefinite Matrix,” 
Lin. Alg. and Its Apple. 103, 103-118. 

R. Mathias (1993). “Perturbation Bounda for the Polar Decomposition,” SIAM J. Matriz 
AnaL Appl. 14, 588-597. 

R-C. Li (19095). "New Perturbation Bounda for the Unitary Polar Factor,” SIAM J. 
Matriz Anal. Appl. 16, 327-332. 


Computationally-ortented references for the polar decomposition and the square root are 
given in §8.6 and §11.2 respectively. 
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4.3 Banded Systems 


In many applications that involve linear systems, the matrix of coefficients 
is banded. This is the case whenever the equations can be ordered so that 
each unknown z; appears in only a few equations in a "neighborhood" of 
the ith equation. Formally, we say that A = (a,;) has upper bandwidth q 
if aj; = 0 whenever j > i +g and lower bandwidth p if a;; = 0 whenever 
i jp. Substantial economies can be realized when solving banded 
systems because the triangular factors in LU, GGT, LDMT, etc., are also 
banded. 

Before proceeding the reader is advised to review §1.2 where several 
aspects of band matrix manipulation are discussed. 


4.3.1 Band LU Factorization 


Our first result shows that if A is banded and A = LU then L(U) inherits 
the lower (upper) bandwidth of A. 


Theorem 4.3.1 Suppose A € IR?"" has an LU factorization A = LU. If A 
has upper bandwidth q and lower bandwidth p, then U has upper bandwidth 
q and L has lower bandwidth p. 


Proof. The proof is by induction on n. From (3.2.6) we have the factor- 
ization 


a-fe wr] | 1 0 1 0 a wt 
(v Bl” Lv/a ly 0 B-vw7/a 0 lal. 

It is clear that B — vw? /a has upper bandwidth q and lower bandwidth p 

because only the first q components of w and the first p components of v 


are nonzero. Let L,U, be the LU factorization of this matrix. Using the 
induction hypothesis and the sparsity of w and v, it follows that 


have the desired bandwidth properties and satisfy A = LU. O 


The specialization of Gaussian elimination to banded matrices having an 
LU factorization is straightforward. 


Algorithm 4.3.1 (Band Gaussian Elimination: Outer Product Ver- 
sion) Given A € IR"“" with upper bandwidth q and lower bandwidth p, 
the following algorithm computes the factorization A = LU, assuming it 
exists. A(i, į) is overwritten by L(i,j) if i > j and by U(i, j) otherwise. 


4.3. BANDED SYSTEMS 153 


for k=1:n-1 
for i = k + l:min(k + p,n) 
A(i, k) = A(i, K)/ A(k, k) 
end 
for j = k + l:min(k + q, n) 
for i = k + l:min(k + p,n) 
A{i, j) = A(i, j) — A(i, k) AK, j) 
end 
end 
end 


ifn `> p and n 7» q then this algorithm involves about 2npq flops. Band 
versions of Algorithm 4.1.1 (LDMT) and all the Cholesky procedures also 
exist, but we leave their formulation to the exercises. 


4.3.2 Band Triangular System Solving 


Analogous savings can also be made when solving banded triangular sys- 
tems. 


Algorithm 4.3.2 (Band Forward Substitution: Column Version) 
Let L € E"*? be a unit lower triangular matrix having lower bandwidth 
p. Given b € R”, the following algorithm overwrites b with the solution to 
iz = 6, 


for j = 1:n 
for i = j + l:min({j + p,n) 
b(i) = bfi) — L(t, 3)8G) 
end 
end 


If n >> p then this algorithm requires about 2np flopa. 


Algorithm 4.3.3 (Band Back-Substitution: Column Version) Let 
U € IR**? be a nonsingular upper triangular matrix having upper band- 
width q. Given b € R", the following algorithm overwrites b with the solu- 
tion to Uz = b. 


for j =n: — 1:1 
b) = 6(7)/UG.7) 
for i = max(1,j -— 9):j -— 1 
b(i) = &(i) — U (i, 3)8 7) 
end 
end 


If n >> q then this algorithm requires about 2ng flope. 
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4.3.8 Band Gaussian Elimination with Pivoting 


Gaussian elimination with partial pivoting can also be specialized to exploit 
band structure in A. If, however, PA = LU, then the band properties of £ 
and U are not quite so simple. For example, if A is tridiagonal and the first 
two rows are interchanged at the very first step of the algorithm, then ui 
is nonzero. Consequently, row interchanges expand bandwidth. Precisely 
how the band enlarges is the subject of the following theorem. 


Theorem 4.3.2 Suppose A c KC" is nonsingular and has upper and lower 
banduridths q and p, respectively. If Gaussian elimination with partial piv- 
oting is used to compute Gauss transformations 


M, = I —- al eT j2Lln-1 


and permutations P;,...,P,-1 such that My LP, MPA =U is up- 
per triangular, then U has upper bandwidth p +q and af = 0 whenever 
i<j ort > itp. 


Proof. Let PA = LU be the factorization computed by Gaussian elimi- 
nation with partial pivoting and recall that P = P, ,-.. Pi. Write PT = 
[ 65,,..., eq ], Where (51,..., 3n } is a permutation of (1,2,..., n). Es; > i+p 
then it follows that the leading i-by-i principal submatrix of PA is singular, 
since (PA); = a,,; for j = 1:3; p—- land s; - p—1z i. This implies 
that U and A are singular, a contradiction. Thus, s, € i+p for i = lin and 
therefore, PA has upper bandwidth p +g. It follows from Theorem 4.3.1 
that U has upper bandwidth p+ q. 

The assertion about the a) can be verified by observing that M; need 
only zero elements (j + 1, j),..., (J + p. 7) of the partially reduced matrix 
PjMj-iPj-1 ea AAD 


Thus, pivoting destroys band structure in the sense that U becomes 
"wider" than A's upper triangle, while nothing at all can be said about 
the bandwidth of L. However, since the jth column of L is a permutation 
of the jth Gauss vector aj, it follows that L has at most p -- 1 nonzero 
elements per column. 


4.3.4 Hessenberg LU 


As an example of an unsymmetric band matrix computation, we show how 
Gaussian elimination with partial pivoting can be applied to factor an upper 
Hessenberg matrix H. (Recall that if H is upper Hessenberg then A; = 0, 
t>j7+1)}. After k — 1 steps of Gaussian elimination with partial pivoting 
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we are left with an upper Hessenberg matrix of the form: 


k=3,n=5 


oooo x 
eoocx xX 
OX KK xX 
X X X X X 
X X X X Xx 


By virtue of the special structure of this matrix, we see that the next 
permutation, P, is either the identity or the identity with rows 3 and 4 
interchanged. Moreover, the next Gauss transformation Mg has a single 
nonzero multiplier in the (k + 1,4) position. This illustrates the Ath step 
of the following algorithm. 


Algorithm 4.3.4 (Hessenberg LU) Given an upper Hessenberg matrix 
H € IR?*", the following algorithm computes the upper triangular matrix 
Ma-1Pn-1- MP H = U where each Pj is a permutation and each Mk 
is a Gauss transformation whose entries are bounded by unity. H(i, k) is 
overwritten with U(i,k) if i < k and by (My)e+i0 if i = & +1. An integer 
vector piv(1:n — 1) encodes the permutations. If P = I, then piv(k) = 0. 
If PX interchanges rows k and k + 1, then piv(k) = 1. 


for k — ::n-1 
if [E (k, k)| < |H(k 4- 1, k)] 
piv(k) = 1; H(k, kin) e H(k-- 1, k:n) 
else 
piv(k) = 0 
and 


if H(k,k) #0 
t = -H(k + 1, k)/ H(k,k) 
for jk Ln 

H(k +1,5) = H(k 1,5) + tH(K. j) 

end 
Hí(k--l,k)-t 

end 

end 


This algorithm requires n? flops. 


4.3.5 Band Cholesky 


The rest of this section is devoted to banded Az = b problems where the 
matrix A is also symmetric positive definite. The fact that pivoting is 
unnecessary for such matrices leads to some very compact, elegant algo- 
rithms. In particular, it follows from Theorem 4.3.1 that if A = GGT is the 
Choleaky factorization of A, then G has the same lower bandwidth as A. 
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This leads to the following banded version of Algorithm 4.2.1, gaxpy-based 
Cholesky 


Algorithm 4.3.5 (Band Cholesky: Gaxpy Version) Given a symmet- 
ric positive definite A ¢ IR™*" with bandwidth p, the following algorithm 
computes a lower triangular matrix G with lower bandwidth p such that 
A=GGT. For all i > j, G(i, j) overwrites A(1, 7). 


for j = l:n 
for k = max(1,j —p)j-1 
À = min(k + p,n) 
A(: j) = AQ: j) — A(z, K)AG:A, k) 
end 
A = min(j + p,n) 
A(g:A, 9) = AGA 3/ V AG. 3) 


end 


If n > p then this algorithm requires about n(p? + 3p) flops and n square 
roots. Of course, in a serious implementation an appropriate data structure 
for A should be used. For example, if we just store the nonzero lower 
triangular part, then a (p + 1)-by-n array would suffice. (See §1.2.6) 

If our band Cholesky procedure is coupled with appropriate band trian- 
gular solve routines then approximately np” + 7np + 2n flops and n square 
roots are required to solve Ar = b. For small p it follows that the square 
roots represent a significant portion of the computation and it is prefer- 
able to use the LDLT appreach. Indeed, a careful flop count of the stepa 
A= LDLT, Ly = b, Dz = y, and LT z = z reveals that np? + 8np+n flops 
and no square roots are needed. 


4.3.6 — Tridiagonal System Solving 


As a sample narrow band LDLT solution procedure, we look at the case of 
symmetric positive definite tridiagonal systems. Setting 


1 as 0 

£i 1 : 
L = 

0 C€n—1 1 
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and D = diag(d;,...,d4) we deduce from the equation A = LDLT that: 


an = di 
ükk-1 = Ck-idR-1 k 23 
kk = dye idk = dk + eua k 22m 


Thus, the d; and e; can be resolved as follows: 


di = 411 
for k = 2:n 

ek~i = Gk e-1/dk i; dk = Oke ~ Ck-18k k-1 
end 


To obtain the solution to Ar = b we solve Ly = b, Dz = y, and LT z = z. 
With overwriting we obtain 


Algorithm 4.3.6 (Symmetric, Tridiagonal, Positive Definite Sys- 
tem Solver) Given an n-by-n symmetric, tridiagonal, positive definite 
matrix A and b € IR", the following algorithm overwrites b with the solu- 
tion to Ar = 5. It is assumed that the diagonal of A is stored in d(1:n) and 
the superdiagonal in e(1:n — 1). 


for k= 2: 
t = e(k — i); e(k — 1) = t/d(k — 1); d(k) = d(k) - te(k — 1) 
end 
for k = 2:n 
b(k) = b(k) — e(k — 1)b{k — 1) 
end 


b(n} = b(n)/d(n) 
for k=n—1:-1:1 
b(k) = b(k)/d(k) — e(k)b(k + 1) 


end 


This algorithm requires 8n flops. 


4.3.7 . Vectorization Issues 


The tridiagonal example brings up a sore point: narrow band problems and 
vector/pipeline architectures do not mix well. The narrow band implies 
short vectors. However, it is sometimes the case that large, independent 
sets of such problems must be solved at the same time. Let us look at how 
such a computation should be arranged in light of the issues raised in 51.4. 

For simplicity, assume that we must solve the n-by-n unit lower bidiag- 
onal systems 

AGRO) 00 — k= pnm 
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and that m 2» n. Suppose we have arrays E(1:n — 1, 1:m) and B(L:n, L:m) 
with the property that E(1:n — i,k) houses the subdiagonal of AU) and 
B(1:n, k) houses the kth right hand side b) , We can overwrite b% with 
the solution z“*) as follows: 


for k = 1: a 
for i = 2: 
B(i, k) = B(i, k) - EG — 1, k) B(3 — 1, k) 
end 
end 


The problem with this algorithm, which sequentially solves each bidiagonal 
system in turn, is that the inner loop does not vectorize. This is because 
of the dependence of B(t, k} on B(i — 1, k). If we interchange the k and i 
loops we get 


for i = 2:n 
for k = lim 
B(i,k) = B(i, k) - E(3— 1,4) B(i — 1, k) (4.3.1) 
end 
end 


Now the inner loop vectorizes well as it involves a vector multiply and a 
vector add. Unfortunately, (4.3.1) is not a unit stride procedure. However, 
this problem is easily rectified if we store the subdiagonals and right-hand- 
sides by row. That is, we use the arrays E(1:m, 1:n — 1) and Bü: m,1:n—1) 
and store the subdiagonal of AU) in E(k, 1:n — 1) and b(97 in B(k, 1:n). 
The computation (4.3.1) then transforms to 


for i = 2:n 
for k = im 
B(k,i) = B(k,i) - E(k,i - 1)B(k,i — 1) 
end 
end 


illustrating once again the effect, of data structure on performance. 


4.3.8 Band Matrix Data Structures 


The above algorithms are written as if the matrix A is conventionally stored 
in an n-by-n array. In practice, a band linear equation solver would be or- 
ganized around a data structure that takes advantage of the many zeroes 
in A, Recall from 81.2.6 that if A has lower bandwidth p and upper band- 
width q it can be represented in a (p + q + 1)-by-n array A.band where 
band entry a;; is stored in A.band(i — j --q- 1, j). In this arrangement, the 
nonzero portion of A’s jth column is housed in the jth column of A.band. 
Another possible band matrix data structure that we discussed in 51.2.8 
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involves storing A by diagonal in a 1-dimensional array A.diag. Regardless 
of the data structure adopted, the design of a matrix computation with a 
band storage arrangement requires care in order to minimize subscripting 
overheads. 


Problems 


P 4.3.1 Derive a banded LDM?” procedure similar to Algorithm 4.3.1. 
P4.3.2 Show how the output of Algorithm 4.3.4 can be used to solve the upper Hes- 
senberg system Hz — b. 
P4.3.3 Give an algorithm for solving an unsymmetric tridiagonal system Ar = 6 that 
uses Gaussian elimination with partial pivoting. It should require only four n-vectors of 
floating point storage for the factorization. 
P4.3.4 For C € R?*" define the profile indices m{C,i) = min(j:c; X 0), where 
i= Ln. Show that if A = GGT is the Cholesky factorization of A, then m(A,i) = 
m(G, i) for i = i:n. (We say that G has the same profile as A.) 
P4.3.5 Suppose A c R'*" is symmetric positive definite with profile indices mj = 
m(A,i) where i = lmn. Assume that A is stored in a one-dimensional array v as follows: 
v = {n Glg + +1922) OS mg 1033, - Ormai sna). Write an algorithm that 
overwrites v with the corresponding entries of the Cholesky factor G and then uses this 
factorization to solve Az = b. How many flopa are required? 
P4.3.6 For C € E?" define p(C, i) = max(j:ci # 0). Suppose that A c R**" has an 
LU factorization A = LU and that: 

m(A1 < m(A2 $ - < m(An) 

WAL < p(A2) € «+ < pAn) 
Show that m(AÀ,i) = m(L,i) and p(A, i) = p{U,i) for i = 1:n. Recall the definition of 
m(A, i) from P4.3.4. 
P4.3.7 Develop a gaxpy version of Algorithm 4.3.1. 
P4.3.8 Develop a unit stride, vectorizable algorithm for solving the symmetric positive 
definite tridiagonal systems AU! ZO = p(t), Assume that the diagonals, superdiagonals, 
and right hand sides are stored by row in arrays D, E, and B and that b( is overwritten 
with z 
P4.3.9 Develop a version of Algorithm 4.3.1 in which A is stored by diagonal. 
P4.3.10 Give an example of a 3-by-3 symmetric positive definite matrix whose triding- 
onal part is not positive definite. 
P4.3.11 Consider the Ax = b problem where 


2-1 0. Q ol 
-1 2 ~l 0 

A= 0 +1 2 
` 0 
E ES | 
-1 0. 0-1 2 


This kind of matrix arises in boundary value problema with periodic boundary conditions. 
(a) Show A is singular. (b) Give conditions that b must satisfy for there to exist a solution 
and specify an algorithm for solving it. (c). Assume that n is even and consider the 
permutation 

P = | £3 en #2 eni 83 --] 
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where ej is the kth column of In., Describe the transformed system PT AP(PT 2) = PTh 
and show how to solve it. Assume that there is a solution and ignore pivoting. 


Notes and Referances for Sec. 4.3 
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N.J. Higham (1990). “Bounding the Error in Gaussian Elimination for Tridiagonal 
Systema," SIAM J. Matriz Anal. Appl. 11, 521-530. 
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D.J. Rose (1968). “An Algorithm for Solving & Special Class of Tridiagonal Systems of 
Linear Equations,” Comm. ACM 12, 234-36. 

H.S. Stone (1973). "An Efficient Parallel Algorithm for the Solution of a Tridiagonal 
Linear System of Equations,” J. ACM 20, 27-38. 

M.A. Malcolm and J. Paimer (1974). “A Fast Method for Solving a Clam of Tridiagonal 
Systems of Linear Equations,” Comm. ACM 17, 14-17. 

J. Lambiotte and R.G. Voigt (1975). "The Solution of Tridiagonal Linear Systems of 
the CDC-STAR 100 Computer,” ACM Trans. Math. Soft. 1, 308-29. 

H.S. Stone (1975). “Parallel Tridisgonal Equation Solvers,” ACM Trans. Math. Soft.1, 
289-307. 


D. Kershaw( 1982). "Solution of Single Tridiagonal Linear Systems and Vectorization of 
the ICCG Algorithm on the Cray-1," in G. Roderigue (ed), Parallei Computation, 


N.J. Higham (1986). “Efficient Algorithme for computing the condition number of a 
tridiagonal matrix,” SIAM J. Sci. and Stat. Comp. 7, 150-165. 


Chapter 4 of George and Liu (1981) contains a nice survey of band methods for positive 
definite syatema. 
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4.4  Symmetric Indefinite Systems 


A symmetric matrix whose quadratic form zT Az takes on both positive and 
negative values is called indefinite. Although an indefinite A may have an 
LDLT factorization, the entries in the factors can have arbitrary magnitude: 


e1] F3 o]fe o 1 077 

10|^iye 1410 -tye]ljie 1] c 
Of course, any of the pivot strategies in 83.4 could be invoked. However, 
they destroy symmetry and with it, the chance for a “Cholesky speed” 
indefinite system solver. Symmetric pivoting, i.e., data reshufflings of the 
form A — PAPT, must be used as we discussed in §4.2.9. Unfortunately, 
symmetric pivoting does not always stabilize the LDLT computation. If e, 
and cz are small then regardless of P, the matrix 


A= [5 | 


has small diagonal entries and large numbers surface in the factorization. 
With symmetric pivoting, the pivots are always selected from the diagonal 
and trouble results if these numbers are small relative to what must be 
zeroed off the diagonal. Thus, LDLT with symmetric pivoting cannot be 
recommended as a reliable approach to symmetric indefinite system solving. 
It seems that the challenge is to involve the off-diagonal entries in the 
pivoting process while at the same time maintaining symmetry. 

In this section we discuss two ways to do this. The first method is due 
to Aasen(1971) and it computes the factorization 


PAPT = LTL" (4.4.1) 
where L = (£;;) is unit lower triangular and T is tridiagonal. P is a permu- 


tation chosen auch that |/;;| € 1. In contrast, the diagonal pivoting method 
due to Bunch and Parlett (1971) computes a permutation P such that 


PAPT = LDLT (4.4.2) 


where D is a direct sum of 1-by-1 and 2-by-2 pivot blocks. Again, P is 
chosen so that the entries in the unit lower triangular L satisfy |£,;| < 1. 
Both factorizations involve n?/3 flops and once computed, can be used to 
solve Az = b with O(n?) work: 


PAP? = LTL" ,Lz = Pb, Tw = cz DLTy-ur-Py > Ar=b 


PAPT = LDL, Lz = Pb, Dw = z, Ly =w,r = Py > Ar=b 


The only thing “new” to discuss in these solution procedures are the Tw = z 
and Dw = z systems. 
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In Aasen's method, the symmetric indefinite tridiagonal system Tw = z 
is solved in O(n) time using band Gaussian elimination with pivoting. Note 
that there is no serious price to pay for the disregard of symmetry at this 
level since the overall process is O(n*). 

In the diagonal pivoting approach, the Dw = z system amounta to a set 
of 1-by-1 and 2-by-2 symmetric indefinite systems. The 2-by-2 problems 
can be handled via Gaussian elimination with pivoting. Again, there is no 
harm in disregarding symmetry during this O(n) phase of the calculation. 

Thus, the central issue in this section is the efficient computation of the 
factorizations (4.4.1) and (4.4.2). 


4.4.1 The Parlett-Reid Algorithm 


Parlett and Reid (1970) show how to compute (4.4.1) using Gauss trans- 
forms. Their algorithm is sufficiently illustrated by displaying the k = 2 
step for the case n = 5. At the beginning of this step the matrix A has 
been transformed to 


a By 0 0 0 
Bi az v3 "4 t5 
AM = MiP AP? MT = Q t$ X X X 
O vw x x Xx 
Ü tv x X X 


where P, is & permutation chosen so that the entries in the Gauss trans- 
formation M, are bounded by unity in modulus. Scanning the vector 
(va t4 vs)? for its largest entry, we now determine a 3-by-3 permutation È 
such that 


_ | a Üs 
Py] ug | = | te => lës] = mex], [94], lös} - 
Us L 


If this maximal element is zero, we set M; = P; = I and proceed to the 
next step. Otherwise, we set P, = diag(2, Fs) and Mz =I ~ ae] with 


a) = (0 0 0 a/ü; 05/0; y 


and observe that 
ay fh 0 0 Q0 
Bi ag 9, 0 Q0 
AO = M,BjAUPTMT = ] 0 à x x x 
0 O0 x x x 
0 O x x x 


In general, the process continues for n —2 steps leaving us with a tridiagonal 
matrix 


T = AD = (M, Pa MIPL)A(MS P2 MAYE. 
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Tt can be shown that (4.4.1) holds with P = P, 4--- P, and 
L= (Ma oPa a MP, PT)". 


Analysis of L reveals that its first column is e; and that its subdiagonal 
entries in column k with k > 1 are “made up" of the multipliers in M4. ;. 

The efficient implementation of the Parlett-Reid method requires care 
when computing the update 


AU) = M,(P,AU7D PT) MT. (44.3) 


To see what is involved with a minimum of notation, suppose B = BT has 
order n — k and that we wish to form: By = (I — weT)B(I — wel)? where 
w € R"—* and e, is the first column of Z, 4. Such a calculation is at the 
heart of (4.4.3). If we set 


u = Be, - thy, 


then the lower half of the symmetric matrix By, = B — wu” — uw? can 
be formed in 2(n — k)? flops. Summing this quantity as k ranges from 1 
to n — 2 indicates that the Parlett-Reid procedure requires 2n?/3 flops— 
twice what we would like. 


Example 4.4.1 If the Parlett-Reid algorit 


= 

I 
| 
Bem © 
WK B 


then 
P = [e es eg eg} 
M = Ta- (0, 0, 2/3, 1/3, )T T 
P = |a ete] 
Ma = -(0, 0, 0, 1/2)TeT 
and PAPT = LTLT , where P = (ey, es, ea, ea], 


1 0 0 0 0 3 0 0 
.]9o 1 o 0 _|3 4 ams o 
L-1o ys 1 of "d T= 19 os 199 o | 
1 0 


0 2/3 1/2 


4.4.2 The Method of Aasen 


An n?/3 approach to computing (4.4.1) due to Aasen (1971) can be derived 
by reconsidering some of the computations in the Parlett-Reid approach. 
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We need a notation for the tridiagonal T: 


a, AL es 0 
Bo : 
T= l 
: . Pn- 1 
QO oe Bn-1 an 


For clarity, we temporarily ignore pivoting and assume that the factoriza- 
tion A = LTL? exists where L is unit lower triangular with L(:,1) = ei. 
Aasen's method is organized as follows: 


for j =n 
Compute h(1:7) where h = TLTe; = Hej. 
Compute a(7). 
if j€n-1 
Compute (7) (4.4.4) 
end 
ifjzn-2 
Compute L(j + 2:n,j +1). 
end 
end 


Thus, the mission of the jth Aasen step is to compute the jth column of 
T and the (j + 1)-st column of L. The algorithm exploits the fact that the 
matrix H = TLT is upper Hessenberg. As can be deduced from (4.4.4), 
the computation of a(5), J(j), and L(j + 2:n, j +1) hinges upon the vector 
h(1:j) = H(1:5, j). Let us see why. 

Consider the jth column of the equation A = LH: 


A(;,j) = £(:, 1:j + 1)h(1:j +1). (4.4.5) 


This says that A(;, j} is a linear combination of the first j + 1 columns of 
L. In particular, 


A(j *1:m,j) = LG + 1n, ligase) + L(G + ling + DAG 4 1). 
It follows that if we compute 
v(j tin) = A(j tia, 7) LOG + Ln, ya), 


then 
L(jc-lmj-1)h(j-1) = wj hn). (4.4.6) 
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Thus, L(j + 2:1,j + I) is a scaling of v(j + 2:n). Since L is unit lower 
triangular we have from (4.4.6) that 


vG +1) =h 41) 


and so from that same equation we obtain the following recipe for the 
(j + 1)-8t column of L: 


LG + 2n, j +1) = v(j + Zzin)/w(j +1). 
Note that L(j + 2:n, j + 1} is a scaled gaxpy. 
We next develop formulae for a(j) and S(j). Compare the (5, j) and 
(j 41,5) entries in the equation H = TLT. With the convention 2(0) = 0 
we find that h(j) = BG — 1)L(j, j — 1) + a(j) and h(j +1) = v(j +1) and 


h(j) - BG - LG, j - 1) 


a(i) 


80) 
With these recipes we can completely describe the Aasen procedure: 


1 


v(j +1). 


for j = l:n 

Compute h(1:5) where h = TLTe,. 

ifj=1vj=2 
a(j) = ht) 
a(j) = AG) - BG - DLG, j - 1) 

end 

Wj<n-l (4.4.7) 
v(j + ln) = AG + Lin, j) - L(j + len, 1:5)A(1:7) 
8) = v6 +1) 

end 


ifj<n-2 
L(j + 2in, 7 + 1) = vlj + 2:n)/v(5 + 1) 


else 


end 
end 


To complete the description we must detail the computation of A(1:7}. 
From (4.4.5) it follows that 
A(L$,j) = £(1:9, 1:3) A(1:3) . (4.4.8) 


This lower triangular system can be solved for h(1:7) since we know the first 
j columns of L. However, a much more efficient way to compute H(1:), j) 
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is obtained by exploiting the jth column of the equation H = TLT. In 
particular, with the convention that (0) L(j, 0) = 0 we have 


h(k) = B(k — IL, k — 1) + a(K) LG k) + B(E) LG k + 1)- 


for k = 1:3. These are working formulae except in the case k = j because 
we have not yet computed a(j) and (5). However, once h(1:j -- 1) is known 
we can obtain h(j) from the last row of the triangular system (4.4.8), i.e., 


ji 


hi) = AG.) - 3 LG, khk). 


k=l 


Collecting resulta and using a work array £(1:n) for L(j, 1:7) we see that 
the computation of h(1:7) in (4.4.7) can be organized as follows: 


ifj-1 
h(1) = A(1,1) 
elseif j — 2 
h(1) = 8(1); A(2) = A(2,2) (4.4.9) 
e 
£(0) —0; (1) = 0; (2:51) = Lj, Zj - 1; t) =1 
h(j) = AG, j) 
for k =1:3-—1 


h(k) = B(k — 1)é(& — 1) + (hk) e(R) + BCEM(k + 1) 
h(j) = h(j) — ECR) A(R) 
end 
end 


Note that with this O(j) method for computing (1:7), the gaxpy calcula- 
tion of v(j + l:n) is the dominant operation in (4.4.7). During the jth step 
this gaxpy involves about 2j(n — j) flops. Summing this for j = 1:n shows 
that Aasen’s method requires n?/3 flops. Thus, the Assen and Cholesky 
algorithms entail the same amount of arithmetic. 


4.4.3 Pivoting in Aasen's Method 


As it now stands, the columns of L are scalings of the v-vectors in (4.4.7). 
If any of these scalings are large, i.e., if any of the v(j + 1)'s are small, 
then we are in trouble. To circumvent this problem we need only permute 
the largest component of v(j + l:n) to the top position. Of course, this 
permutation must be suitably applied to the unreduced portion of A and 
the previously computed portion of L. 


Algorithm 4.4.1 (Aasen's Method) If A c EC*" is symmetric then 
the following algorithm computes a permutation P, a unit lower triangular 
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L, and a tridiagonal T such that PAP? = LTLT with |L(i,j)| X 1. The 
permutation P is encoded in an integer vector piv. In particular, P = 
P,--- Pha where P; is the identity with rows piv(j) and j 4-1 interchanged. 
The diagonal and subdiagonal of T are stored in a(1:n) and fi(1:n — 1), 
respectively. Only the subdiagonal portion of L(2:n,2:n) is computed. 


for j= lin 
Compute h(1:7) via (4.4.9). 
ifj21vj22 
a(j) = h() 
eise 


a(j) = AG) - 8G - OLG. j — 1) 
end 


ifj S n-1 
v(j + Ln) = A(j + lin, j) — LO + tin, L:3)h(1:5) 
Find q so |v(g)| = || vj + 1:n) ll, with j--1 Sq S n. 
piv(j) =q; v(j + 1) = v(g); LG +1, 2:7) = L(a,2:5) 
A(J * 1,3 + En) e Alg, j + in) 
A(j + 1n, j +1) e AG + En,q) 
AG) 2 vG +1) 


end 
ifjzn-2 
L(j + Xin, j + 1) = v(j + 2:n) 
if vj +1) #0 
L(j + 2ia,j +1} = L(j + 2:n,j - 1)/v(j +1) 
end 
end 


end 


Aasen’s method is stable in the same sense that Gaussian elimination with 
partial pivoting is stable. That is, the exact factorization of a matrix near 
A is obtained provided [| T |2/]] A ijz = 1, where T is the computed version 
of the tridiagonal matrix T. In general, this is almost always the case. 

In a practical implementation of the Aasen algorithm, the lower trian- 
gular portion of A would be overwritten with L and T. Here is n = 5 
case: 


B3 o4 

Ac | 42 fa m 
fa Lag By 04 
fea fsa f Ps Os 


Notice that the columns of L are shifted left in this arrangement. 
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4.4.4 Diagonal Pivoting Methods 


We next describe the computation of the block LDLT factorization (4.4.2). 
We follow the discussion in Bunch and Parlett (1971). Suppase 

E cT 8 
C B n-—s 
s$n-s 


PAPI = | 


where P, is a permutation matrix and s = 1 or 2. 1f A is nonzero, then it is 
always possible to choose these quantities so that E is nonsingular thereby 
enabling us to write 


L 0 E 0 E E-ict 


T 
PAP, = Ez Ina J 0 B-CE-'CT}!] 0 Rh. 


For the sake of stability, the s-by-s "pivot" E should be chosen so that the 
entries in ` 
Å = (45) = B-CE cT (4.4.10) 


are suitably bounded. To this end, let œ € (0,1) be given and define the 
size measures 


Ho = max jail 
*J 


ll 


Hi max jay]. 
1 


The Bunch-Parlett pivot strategy is as follows: 


if uy 2 apo 

s=1 

Choose P, so |eyi| = 4. 
eise 

g3=2 

Choose B, 30 jezı] = p. 
end 


It is easy to verify from (4.4.10) that if s = 1 then 


las] S (1 +a" (44.11) 
while 3 — 2 implies 
3-a 
aij; € — H. .4. 
yl X PTa (4.4.12) 
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By equating {1 + a7! ^, the growth factor associated with two s = 1 steps, 
and (3~«)/(1—a), the corresponding s = 2 factor, Bunch and Parlett con- 
clude that a = (1 + v/17)/8 is optimum from the standpoint of minimizing 
the bound on element growth. 

The reductions outlined above are then repeated on the n — s order 
symmetric matrix A. A simple induction argument establishes that the 
factorization (4.4.2) exists and that n?/3 flops are required if the work 
associated with pivot determination is ignored. 


4.4.5 Stability and Efficiency 


Diagonal pivoting with the above strategy is shown by Bunch (1971) to be 
as stable as Gaussian elimination with complete pivoting. Unfortunateiy, 
the overall process requires between n*/12 and n?/6 comparisons, since uo 
involves a two-dimensional search at each stage of the reduction. The actual 
number of comparisons depends on the total number of 2-by-2 pivots but 
in general the Bunch-Parlett method for computing (4.4.2) is considerably 
slower than the technique of Aasen. See Barwell and George(1976). 

This is not the case with the diagonal pivoting method of Bunch and 
Kaufman (1977). In their scheme, it is only necessary to scan two columns 
at each stage of the reduction. The strategy is fully illustrated by consid- 
ering the very first step in the reduction: 


a= (1 * vAT)/8; A= art = max{lagil,---,|@nil} 
0 


if A> 
if |a31| > ad 
= 1; P al 
else 
q= lapr| = max{|ai,, try lac-il lar ciel. tt) lancl) 
if ala > ad? 
sziP =Í 


elseif |a,.| > ac 
s= 1 and choose P, so (PT AP); = a. 
else 
s = 2 and choose P, so (PT AP1); = arp- 
end 
end 
end 


Overall, the Bunch-Kaufman algorithm requires n?/3 flops, O(n”) compar- 
isons, and, like all the methods of this section, n?/2 storage. 
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Example 4.4.3 If the Bunch Kaufman algorithm is applied to 
1 10 2 
A=]10 1 3 
20 30 1 
then in the first step A = 20, r = 3, c = 30, and p= 2. The permutation P = [es e2 21 | 
is applied giving 
1 30 20 
PAPT = | 30 1 10}. 
20 10 1 


A 2-by-2 pivot is then used to produce the reduction 


i 0 o 1 30 0 1 o 0]? 
PAPT = o 1 0 30 1 0 0 i 0 
.3115 .8583 1 0 0 -—117920 A415 .8563 1 


4.4.6 A Note on Equilibrium Systems 
À very important class of symmetric indefinite matrices have the form 
A= C Bin 
~ | BT 0| p (4.4.13) 
n p 


where C is symmetric positive definite and B has full column rank. These 
conditions ensure that A is nonsingular. 

Of course, the methods of this section apply to A. However, they do not 
exploit its structure because the pivot strategies “wipe out” the zero (2,2) 
block. On the other hand, here is a tempting approach that does exploit 
A's block structure: 


(a) Compute the Cholesky factorization of C, C = GGT. 

(b) Solve GK = B for K € RF. 

(c) Compute the Cholesky factorization of KTK = B*C-!B, HHT = 
KTR. 


From this it follows that 
A= G 0 GT K 
T| KT H 0 -HT | 
In principie, this triangular factorization can be used to solve the equilib- 


nium system 
lar ells]- Hl (4.4.14) 
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Hawever, it is clear by considering steps (b) and (c) above that the accuracy 
of the computed solution depends upon x(C) and this quantity may be 
much greater than (A). The situation has been carefully analyzed and 
various structure-exploiting algorithms have been proposed. A brief review 
of the literature is given at the end of the section. 

But before we close it is interesting to consider a special case of (4.4.14) 
that clarifies what it means for an algorithm to be stable and illustrates 
how perturbation analysis can structure the search for better methods. 
In several important applications, 9 = 0, C is diagonal, and the solution 
subvector y is of primary importance. À manipulation of (4.4.14) shows 
that this vector is specified by 


y = (BT C1 B)! BTO! f. (4.4.15) 


Looking at this we are again led to believe that <(C’) should have a bearing 
on the accuracy of the computed y. However, it can be shown that 


| (BTC718)-1 BTC | € bp (4.4.16) 


where the upper bound wg is independent of C, a result that (correctly) 
suggests that y is not sensitive to perturbations in C. A stable method for 
computing this vector should respect this, meaning that the accuracy of 
the computed y should be independent of C. Vavasis (1994) has developed 
a method with this property. It involves the careful assembly of a matrix 
V e R"* (^7? whose columns are a basis for the nullspace of B" C-!. The 
n-by-n linear system 


ip. vi[*]-7 


is then solved implying f = By + Vq. Thus, BTC-!f = BTC-!By and 
(4.4.15) holds. 


Problems 


P4.4.1 Show that if all the 1-by-1 and 2-by-2 principal submatrices of an n-by-n 
symmetric matrix A are singular, then A is zero. 
P4.4.2 Show that no 2-by-2 pivots can arise in the Bunch-Kaufman algorithm if A is 
positive definite. 
P4.4.3 Arrange Algorithm 4.4.1 so that only the lower triangular portion of A is 
referenced and so that a(j) overwrites A(j, j) for j = Lin, 8(j) overwrites A(j + 1,7) for 
j= l:n — 1, and L(i, j} overwrites A(i, j — 1) for j = 2:n — 1 and i = j + En. 
P4.4.4 Suppose 4 € R**^ is nonsingular, symmetric, and strictly diagonally dominant. 
Give an aigorithm that computes the factorization 
T R oy; RT sT 

MAN -[3 -«]l 0 MT 
where R € R*** and M e R(^-0X(^—8) are lower triangular and nonsingular and II is 
& permutation. 
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P4.4.5 Show that if 
An Au n 


An -An | P 
n p 
is symmetric with Ay, and An positive definite, then it has an LDLT factorization with 
the property that 
D- Dı 0 
~ 0 -Di 
where D, € R°*" and D4 € RP* have positive diagonal entries. 
P4.4.8 Prove (4.4.11) and (4.4.12). 
P4.4.7 Show that —(8TC-!B)-? is the (2,2) block of A7! where A is given by (4.4.13). 


Ax 


P4.4.8 The point of this problem is to consider a special case of (4.4.15). Define the 
matrix 

M(a) =(B7 C7) By !gTc-! 
where 

C (Mae) 02-1 
and ¢ = J,(:,k). (Note that C is just the identity with a added to the (k, k} entry.) 
Assume that B c RO“? has rank p and show that 


= (BT m-!pT _ a T 
M(a) = (BT B)-! B (s. repre") 


where w = (In — B(BT B) !BT)ez. Show that if | wl]; = 0 or [wi], = 1, then 
] M(a) [ly = 1/emin(E). Show that if 0 < || wi, < 1, then 


1 1 
| M(a) le £ med i. irig) / nnnm 


Thus, || M(a) ||; has an a-independent upper bound. 
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Math. Comp. 158, 475-480. 


The equilibrium system literature is scattered among the several application areas where 
it has an important role to play. Nice overviews with pointers to this literature include 


G. Strang (1988). “A Framework for Equilibrium Equations,” SIAM Review 30, 283-297. 
S.A. Vavesis (1994). "Stable Numerical Algorithms for Equilibrium Systema,” SIAM J. 
Matriz Anai. Appl 15, 1108-1131. 


Other papers include | 


C.C. Paige (1079). “Fast Numerically Stable Computations for Generalized Linear Least 
Squares Problems,” SIAM J. Num. Anal 15, 165-71. 

A. Björck and LS. Duf (1980). “A Direct Method for the Solution of Sparse Linear 
Least Squares Problems,” Lin. Aig. and Its Applic. 34, 43-67. 

A. Björck (1992). “Pivoting and Stability in the Augmented System Method,” Proceed- 
ings of the 14th Dundee Conference, D.F. Griffitha and G.A. Watson (eds), Longman 
Scientific and Technical, Easex, U.K. 

P.D. Hough and S.A. Vavasis (1996). “Complete Orthogonal Decomposition for Weighted 
Least Squares,” SIAM J. Matriz Anal. Appl, to appear. 


Some of these papers make use of the QR factorization and other least squares ideas 
that are discussed in the next chapter and §12.1. 

Problems with structure abound in matrix computations sad perturbation theory 
has a key role to play in the search for stable, efficient algorithms. For equilibrium sys- 
tems, there are several results like (4.4.15) that underpin the most effective algorithms. 
See 
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A. Forsgren (1995). “On Linear Least-Squares Problems with Diagonelly Dominant 
Weight Matrices,” Technical Report TRITA-MAT-1995-OS2, Department of Mathe- 
matics, Royal Institute of Technology, 5-100 44, Stockholm, Sweden. 


and the included references, A discussion of (4.4.15) may be found in 


G.W. Stewart (1989). “On Scaled Projections and Peeudoinverses,” Lin. Alg. and Its 
Applic, 118, 189-193. 

D.P. O'Leary (1990). “On Bounds for Scaled Projections and Pseudoinverses," Lin. Alg. 
and ite Applic. 152, 115-117. 

M.J. Todd (1990). “A Dantszig-Wolfe-like Variant of Karmarker’s Interior-Point Linear 
Programming Algorithm,” Operntions Research 58, 1006-1018. 


4.5 Block Systems 


In many application areas the matrices that arise have exploitable block 
structure. As a case study we have chosen to discuss block tridiagonal 
systems of the form 


D HA " 0 n bi 
Ej Dj °°. : T2 b; 
t t ` : = : (4.5.1) 
: oof Fa-1 : 
0 En-1 Dy Zn bs 


Here we assume that all blocks are q-by-g and that the r; and b; are in 
R’. In this section we discuss both a block LU approach to this problem as 
well as a divide and conquer scheme known as cyclic reduction. Kronecker 
product systems are briefly mentioned. 


4.5.1 Block Tridiagonal LU Factorization 


We begin by considering a block LU factorization for the matrix in (4.5.1). 
Define the block tridiagonal matrices A, by 


D, AF ee Ü 
E, Dj : 
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Comparing blocks in 


I vee 0 Un ien 0 
L I : 0 U. n. : 
An = Tete DEMNM (4.5.3) 
: n ] te "n F4 
0 -e In-1 I 0 -- 0 Un 


we formally obtain the following algorithm for the L; and Uj: 


V =D, 
for i = 2n 
Solve Li-1U;-i = Ei-i for Li-i. (4.5.4) 


U; = D; - Lii Fia 
end 


Tbe procedure is defined so long as the U; are nonsingular. This is assured, 
for example, if the matrices A,,...,A, are nonsingular. 

Having computed the factorization (4.5.3), the vector z in (4.5.1) can 
be obtained via block forward and back substitution: 


vi = by 
for i = 2:n 
yi = bi — Li-iyica 
end (4.5.5) 


Solve U4z4 = ya for Tn. 
for t=n—1:-1:1 

Solve Uitzi =y — Fixigni for Zi. 
end 


To carry out both (4.5.4) and (4.5.5), each U; must be factored since linear 
systems involving these submatrices are solved. This could be done using 
Gaussian elimination with pivoting. However, this does not guarantee the 
stability of the overall process. To see this just consider the case when the 
block size q is unity. 


4.5.2 Block Diagonal Dominance 


In order to obtain satisfactory bounds on the L; and U; it is necessary 
to make additional assumptions about the underlying block matrix. For 
example, if for i = i:n we have the block diagonal dominance relations 


HOP ORE ER)! Ens (4.5.6) 
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then the factorization (4.5.3) exists and it is possible to show that the L; 
and U; satisfy the inequalities 


WZ, Soi (4.5.7) 
[Ud sS Wands (4.5.8) 


4.5.3 Block Versus Band Solving 


At this point it is reasonable to ask why we do not simply regard the matrix 
A in (4.5.1) as a qn-by-qn matrix having scalar entries and bandwidth 
2q — 1. Band Gaussian elimination as described in §4.3 could be applied. 
The effectiveness of this course of action depends on such things as the 
dimensions of the blocks and the sparsity patterns within each block. 


To illustrate this in a very simple setting, suppose that we wish to solve 


(2 &]la]- [5] (459) 


where D4 and Dz are diagonal and F, and E, are tridiagonal. Assume 
that each of these blocks is n-by-n and that it is "safe" to solve (4.5.9) via 
(4.5.3) and (4.5.5). Note that 


U = D (diagonal) 
L = EU! (tridiagonal) 
U = D-H; (pentadiagonal) 
n = by 
ys = b- E(D n) 

Ut = É 

Dizi = w— Arz 


Consequently, some very simple n-by-n calculations with the original banded 
blocks renders the solution. 

On the other hand, the naive application of band Gaussian elimination 
to the system (4.5.9) would entail a great deal of unnecessary work and 
storage as the system has bandwidth n + 1. However, we mention that by 
permuting the rows and columns of the system via the permutation 


P = [es tnt Et En, an] (4.5.10) 
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we find (in the n — 5 case) that 


PAPT 


li 
ocoocooocox Gx XK 
ececococo x xX *K 
Oooo x @xkK K X ©& 
ooococxk x xox 
oox OK XX OCS 
eSeoox x xX ox oo 
X OxXXxXxocoooco 
Ox x xox OOGO 
x xX Xcococoooo 
XX OX coooooo 


This matrix has upper and lower bandwidth equal to three and so a very 
reasonable solution procedure results by applying band Gaussian elimina- 
tion to this permuted version of A. 

The subject of bandwidth-reducing permutations is important. See 
George and Liu (1981, Chapter 4). We also refer to the reader to Varah 
(1972) and George (1974) for further details concerning the solution of block 
tridiagonal systems. 


4.5.4 Block Cyclic Reduction 


We next describe the method of block cyclic reduction that can be used 
to solve some important special instances of the block tridiagonal system 
(4.5.1). For simplicity, we assume that A has the form 


D F T 0 
F D : 
A= e Rw" (4.5.11) 
: oN F 
O vs FD 


where F and D are q-by-4 matrices that satisfy DF = FD. We also assume 
that n = 2* — 1. These conditions hold in certain important applications 
such as the discretization of Poisson's equation on a rectangle. In that 
situation, 


4 -1 e 0 
-1 4 : 
D = NOS (4.5.12) 
"sl —1 . 
0 -1 4 
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and F = —1,. The integer n is determined by the size of the mesh and can 
often be chosen to be of the form n = 2* — 1. (Sweet (1977) shows how to 
proceed when the dimension is not of this form.) 

The basic idea behind cyclic reduction is to halve the dimension of the 
problem on hand repeatedly until we are left with a single q-by-q system 
for the unknown subvector rq.-:. This system is then solved by standard 
means. The previously eliminated r; are found by a back-substitution 
process. 

The general procedure is adequately motivated by considering the case 
nz: 

b, 

bg 

bs 

b, 

bs 

ba 
by 


Day + Fre 
Fr, + Dro + Fr 
Fug + Dz; + Fry 
Fury + Dry + Frs 
Fri + Dag + Fre 
Frs + Drg + Fr 
Fre + Dr 
(4.5.13) 
For i = 2, 4, and 6 we multiply equations i — 1, i, and i+ 1 by F, —D, and 
F, respectively, and add the resulting equations to obtain 
QrF-DUr + F?z, = F(b, +b) - Db; 
F?z; + (2r? - D*)z, + Fxg = F(bg + bs) - Dh, 
Fiz, + (2F? — D'Ü)ng = Fibs + b) — Dbs 
Thus, with this tactic we have removed the odd-indexed z; and are left 
with a reduced block tridiagonal system of the form 


y bod Pod dod 


DOr, + Fx, = pi) 
FOr + DOr, + FUn = aP 
FOr, + DOr = WU 


where D!) = 2F° — D? and FC) = F? commute. Applying the same elim- 
ination strategy as above, we multiply these three equations respectively 
by FO), DO, and FO), When these transformed equations are added 
together, we obtain the single equation 


(ai - po) z, = FU (of + 4?) - px? 
which we write as 
Dr, zb. 


This completes the cyclic reduction. We now solve this (small) q-by-q sys- 
tem for z4. The vectors z4 and ze are then found by solving the systems 


Dg, = af) plz, 
Dz, = A. Re, 
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Finally, we use the first, third, fifth, and seventh equations in (4.5.13) to 
compute 2, r3, T5, and zy, respectively. 

For general n of the form n = 2* —1 we set D® = D, PO) = F, bO = t 
and compute: 


for p - i:k— 1 
Fo) = [Fe] 
DP) = 2) — [pi»-up 
r=% 
for j = 1:2-7 — 1 (4.5.14) 
) - (2-1) ip- 1) - 0-1) 
DA = Fe-)) Ga + Vr) - pt Db 
end 
end 


The z; are then computed as follows: 


Solve D- Da, = BO) for zai 
forp-k-2:-1:0 
r= 2? 
for j = 1:24-P-1 (4.5.15) 
if;-1 
c= oe) ir — FO) Toir 


elseif j = 2*-7*1 
c= D uy, Fraa 


else 
c= d aye - FO (za; + 1(4-2») 
end 
Solve DP tuj- = c for Tüj-1)r 
end 
end 


The amount of work required to perform these recursions depends greatly 
upon the sparsity of the D) and F. In the worse case when these 
matrices are full, the overall flop count has order log(n)q?. Care must be 
exercised in order to ensure stability during the reduction. For further 
details, see Buneman (1969). 


Example 4.5.1 Suppose q = 1, D = (4), and F = (-1) in (4.5.14) and that we wish to 
solve: 


4 -1 0 0 ü 0 0 ti 2 
-1 4 -1 Ü 0 0 [i] Z1 4 
0 -1 4 -i 0 0 0 z3 6 
0 0 -1 4 -1 0 0 Z4 = 8 
0 0 0 -1 4-1 9 zs 10 
0 0 ü 0 -1 4 -1 za 12 
0 Ò 0 0 0 -i 4 zr 22 
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By executing (4.5.15) we obtain the reduced systems: 
~14 1 b I -4 
1 -l4 1 ze = -43 p=1 
fs] 1 -14 78 —B0 


[m] [a ][-m] pea 
The z; are then determined via (4.5.16): 


and 


p=2: x44 
pl: r2 za 3 6 
p=0 z-1 m= zgpa5 zy cT 


Cyclic reduction is an example of a divide and conquer algorithm. Other 
divide and conquer procedures are discussed in 81.3.8 and 88.6. 


4.5.5 Kronecker Product Systems 
If B e K*" and C € IP**, then their Kronecker product is given by 


buC banG o bin 
bC baal +- bC 
A-BgCc- : EN : 
dn C bm220 tt bac 
Thus, A is an m-by-n block matrix whose (i, j) block is b, C. Kronecker 
products arise in conjunction with various mesh discretizations and through- 


out signal processing. Some of the more important properties that the 
Kronecker product satisfies include 


(A&B)(CSD) = =AC@BD (4.5.16) 
(A@By = AT @BT (4.5.17) 
(A@By' = Ags! ^ (45.18) 


where it is assumed that all the factor operations are defined. 
Related to the Kronecker product is the “vec” operation: 


X(5,1) 
XeR™" + vec( X) = : emm". 
X(n) 


Thus, the vec of a matrix amounts to a “stacking” of its columns. It can 
be shown that 


Y-CXBT e  veæ(Y)=(B8Cye{X). (4.5.19) 
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It follows that solving a Kronecker product system, 
(B&C)y zd 


is equivalent to solving the matrix equation CX B7 = D for X where 
z = vec( X) and d = vec(D). This has efficiency ramifications. To illustrate, 
suppose B,C € EU" are symmetric positive definite. If A = B&C is 
treated as a genera] matrix and factored in order to solve for z, then O(n) 
flops are required since B & C € IR^ *"^, On the other hand, the solution 
approach 


1. Compute the Cholesky factorizations B = GGT and C = HHT. 
2. Solve BZ = D* for Z using G. 
3. Solve CX = ZT for X using H. 
4. 2 = vec( X). 
invalves O(n?) flops. Note that 
BeCc-GGTe HHT -(G& H)(GeH) 


is the Cholesky factorization of B & C because the Kronecker product of a 
pair of lower triangular matrices is lower triangular, Thus, the above four- 
step solution approach is a structure-exploiting, Cholesky method applied 
to Bec. 

We mention that if B is sparse, then BOC has the same sparsity at the 
block level. For example, if B is tridiagonal, then B &C is block tridiagonal. 


Problema 


P4.5.1 Show that & block diagonally dominant matrix is nonsingular. 

P4.5.2 Verify that (4.5.6) implies (4.5.7) and (4.5.8). 

P4.5.3 Suppose block cyclic reduction is applied with D given by (4.5.12) and F = —-I,. 
What can you say about the band structure of the matrices FOP) and DP) that ariga? 


P4.5.4 Suppose A € R°*" i» nonsingular and that we have solutions to the linear 
systems Az = b and Ay = g where b, g € R” are given, Show how to solve the system 


(6 ilz] 5] 


in Ofn) flops where a, € R and h € R” are given and the matrix of coefficients A, is 
nonsingular. The advisability of going for such & quick solution is a complicated iame 
that depends upon the condition numbers of A and A, and other factors. 


P4.5.5 Verify (4.5.16)-(4.5.19). 
Pá4,5.7 Show bow to construct the SVD of B C from the SVDa of BH and C. 
P4.5.8 If A, B, and C are matrices, then it can be shown that (A9 B) SC = A@(BQC) 
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and so we just write A & B @ C for this matrix. Show how to solve the linear system 
(A © Be C)z = d amsuming that A, B, and C are symmetric positive definite. 


Notes and References for Sec. 4.5 


The following papets provide insight into the various nuances of block matrix computa- 
tions: 


J.M. Varah (1972). "On the Solution of Block-Tridiagonal Systems Arising from Certain 
Finite-Difference Equations,” Math. Comp. £6, 359-58. 

J.A. George (1974). "On Block Elimination for Sparse Linear Systems,” SIAM J. Num. 
Anal 11, 585-603. 

R. Fourer (1984). “Staircasa Matrices and Systems," STAM Review 26, 1-71. 

M.L. Merriam (1985). "On the Factorization of Block Tridiagonals With Storage Con- 
straints,” SJAM J. Sci. and Stat. Comp. 5, 182-192. 


The property of block diagonal dominance and its various implications is the central 
theme in 


D.G. Feingold and R.S. Varga (1962). "Block Disgonally Dominant Matrices and Gen- 
eralizations of the Gershgorin Circle Theorem," Pacific J. Math. 12, 1241-50. 


Early methods that involve the ides of cyclic reduction are described in 


RW. Hockney (1965). “A Fast Direct Solution of Poisson's Equation Using Fourier 
Analysis, " J. ACM 12, 95-113. 

B.L. Busbee, G.H. Golub, and C.W. Nielson (1970). “Oa Direct Methoda for Solving 
Poimon's Equations,” SIAM J. Num. Anal. 7, 627-56. 


The accumulation of the right-hand aide must be done with great care, for otherwise 
there would be a significant lom of accuracy. A stable way of doing this is described in 


O. Buneman (1968). “A Compact Non-Iterative Poisson Solver,” Report 294, Stanford 
University Institute for Plasma Research, Stanford, California. 


Other literature concerted with cyclic reduction includes 


F.W. Dorr (1970). “The Direct Solution of the Discrete Poisson Equation on a Rectan- 
gle,” SIAM Review 12, 248-63. 

B.L. Buzbes, F.W. Dorr, J.A. George, and G.H. Golub (1971). "The Direct Solution of 
the Discrete Poisson Equation on Irregular Regiona,” SIAM J. Nem. Anal. 8, 722-38. 

F.W. Dorr (1973). “The Direct Solution of the Discrete Poisson Equation in O(n?) 
Operations,” SIAM Review 15, 412-415. 

P. Concus and G.H. Golub (1973). "Use of Fast Direct Methods for the Efficient Nu- 
merical Solution of Nonseparable Elliptic Equations,” SIAM J. Num. Anal. 10, 
1103-20. 

B.L. Buzbee and F.W. Dorr (1974). "The Direct Solution of the Bibarmonic Equation 
on Hectanguler Regions and the Poisson Equation on Lrregular Regions," SIAM J. 
Num. Anal 11, 753-63. 

D. Heller (1976). "Some Aspects of the Cyclic Reduction Algorithm for Block Tridiagonal 
Linear Systema,” SIAM J. Num. Anal. 13, 484-96. 


Various generalizations and extensions to cyclic reduction have been proposed: 


P.N. Swarztrauber and FLA. Sweet (1973). "The Direct Solution of the Discrete Poisson 
Equation on a Disk,” SIAM J. Num. Anal, 10, 900-907. 
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FLA. Sweet (1974). “A Generalized Cyclic Reduction Algorithm,” SIAM J- Num. Anal. 
11, 506-20. 

M.A. Diamond and D.L.V. Ferreira (1976). “On a Cyclic Reduction Method for the 
Solution of Poisson's Equation,” SIAM J. Num. Anal. 13, 54-70. 

TLA. Sweet (1977). “A Cyclic Reduction Algorithm for Solving Block Tridiagonal Sys- 
tems of Arbitrary Dimension," SIAM J. Num. Anal. 14, T06-20. 

P.N. Swarztrauber and R. Sweet (1989). “Vector and Parallel Methods for the Direct 
Solution of Poisson's Equation," J. Comp. Appl Math. 27, 241—263. 

S. Bondeli and W. Gander (1994). “Cyclic Reduction for Special Tridiagonal Systems,” 
SIAM J. Matriz Anal, Appi. 15, 321-330. 


For certain matrices that arise in conjunction with elliptic partial differential equations, 
block elimination corresponds to rather natural operations on the underlying mesh. A 
classical example of this is the method of nested dissection described in 


A. George (1973). "Nested Dissection of a Regular Finite Element Mesh," SIAM J. 
Num. Anal 10, 345—63, 


We also mention the following general survey: 


J.R. Bunch (1976). "Block Methods for Solving Sparse Linear Systems," in Sparse 
Matriz Computations, J.R. Bunch and D.J. Rose (eds), Academic Presa, New York. 


Bordered linear systems as presented in P4.5.4 are discussed in 


W. Govaerts and J.D. Pryce (1990). "Block Elimination with One Iterative Refinement 
Solves Bordered Linear Systems Accurately,” BIT 30, 490-507. 

W. Govaerta (1991). “Stable Solvers and Block Elimination for Bordered Systems,” 
SIAM J. Matriz Anal. Appl. 12, 460—483. 

W. Govaerts and J.D. Pryce (1993). “Mixed Block Elimination for Linear Systems with 
Wider Borders,” [MA J. Num. Anal. 13, 161-180. 


Kronecker product references include 


H.C. Andrews and J. Kane (1970). “Kronecker Matrices, Computer Implementation, 
and Generalized Spectra,” J. Assoc. Comput. Mach. 17, 260-268. 

C. de Boor (1979). “Efficient Computer Manipulation of Tensor Producta," ACM Trans. 
Math. Soft. 5, 173-182. 

A. Graham (1981). Kronecker Products and Matriz Calculus with Applications, Ellis 
Horwood Ltd., Chichester, England. 

H.V. Henderson and S.R. Searle (1981). “The Vec-Permutation Matrix, The Vec Opera- 
tor, and Kronecker Products: A Review,” Linear and Multilinear Algebra 9, 271-288. 

PA. Regalia and S. Mitra (1989). “Kronecker Products, Unitary Matrices, and Signal 
Processing Applications,” SIAM Review 31, 586-613. 


4.6 Vandermonde Systems and the FFT 


Suppose x(0:n) € R^*!. A matrix V € R(**1*(^*! of the form 


= Ví(zo,...,ra) = 
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is said to be a Vandermonde matriz. In this section, we show how the 
systems VTa = f = f(0:n) and Vz = b = b(0:n) can be solved in O(n?) 
flops. The discrete Fourier transform is briefly introduced. This special and 
extremely important Vandermonde system has a a recursive block structure 
and can be solved in O(nlog n) flops. In this section, vectors and matrices 
are subscripted from 0. 


4.6.1 Polynomial Interpolation: Va = f 


Vandermonde systems arise in many approximation and interpolation prob- 
lems. Indeed, the key to obtaining a fast Vandermonde solver is to recognize 
that solving VTa = f is equivalent to polynomial interpolation. This fol- 
laws because if VTa = f and 


p(x) = Y, ajz/ (4.6.1) 
j=0 


then p(z,) = f; for i = O:n. 

Recall that if the z; are distinct then there is a unique polynomial of 
degree n that interpolates (xo, fo), ..., (za, fa). Consequently, V is non- 
singular as long ss the r; are distinct. We assume this throughout the 
section. 

The first step in computing the a; of (4.6.1) is to calculate the Newton 
representation of the interpolating polynomial p: 


n k-l 
pz) = See (Tie - =] . (4.6.2) 


km ix 
The constants cy are divided differences and may be determined as follows: 


c(ü0:n) = f(0:n) 
for k 2 0n -i 
for i=n:-1:k+1 (4.6.3) 


=(q—- ei i)/ ri — Ti-k-1) 
end 
end 


See Conte and de Boor (1980, chapter 2). 
The next task is to generate a(0:n) from c(0:n). Define the polynomials 
Pa(z),...,po(r) by the iteration 


Pa(z) = 
for k=n-1:-1:0 


PRAT) = ck + (£ — Te) Pe 41 (2) 
end 
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and observe that po(z) — p(z). Writing 


pelz) = al 4 al) z+- afar 


and equating like powers of z in the equation pe = cx + (T — zk)pe,i gives 
the following recursion for the coefficients alt). 


al”) = c, 
fork 2n-1:- 1:0 
P La ma? 


fori—-k--Lin-1 
af) = at) - zat? 


end 
alt) = get) 
end 


Consequently, the coefficients a; = a can be calculated as follows: 
a(0:n) = c(0:n) 
for k-n-1:- 1:0 
fori-kn-1 (4.8.4) 
Gy = ay — IX S.l 
end 
end 


Combining this iteration with (4.6.3) renders the following algorithm: 


Algorithm 4.6.1 Given z(0:n} € R"*! with distinct entries and f = 
J:n) e R7*!, the following algorithm overwrites f with the solution a = 
a(0:n) to the Vandermonde system V{2o,...,2n)7a = f. 
for k=0:n ~1 
for i= 7: -—1:4+1 
fG) = UG) - fG - 1) /(@@ — z(i - k — 1) 


end 
end 
for k=n—-1:-1:0 
fori-kmn-1 
F = f) — Fli + x(k) 
end 
end 


This algorithm requires 5n2/2 flops. 


Example 4.6.1 Suppose Algorithm 4.6.1 ia used to solve 


il 1 iT1T'[a 10 
13 27 "m sac 
14 64 a3 12 


186 CHAPTER 4. SPECIAL LINEAR SYSTEMS 


The first k-loop computes the Newton representation of pz): 
p(z) = 10 + 15(r-1)-48(z— 1)(F — 2} + {z — 1)(z - 2)(z - 3). 
The second k-loop computes a = (4 3 2 1]T. from [10 16 8 1]7. 


4.6.2 The System Vz = b 


Now consider the system Vz = 6. To derive an efficient algorithm for this 
problem, we describe what Algorithm 4.6.1 does in matrix-vector language. 
Define the lower bidiagonal matrix Ly(a) € IRU**t (vU py 


Lla) = 


and the diagonal matrix Dy by 
D, = diag(l....,1,z441 — z0,.... T4 — Zna-k-1). 
k+l 


With these definitions it is easy to verify from (4.6.3) that i£ f = f(0:n) 
and c = ¢{0:n) is the vector of divided differences then c = UT f where U 
is the upper triangular matrix defined by 


UT = D 3 Ls a (0): Dg Lol). 

Similarly, from (4.6.4) we have 
a — Lie, 

where L is the unit lower triangular matrix defined by: 

LT = Lo(ao)? «++ La (zs 1). 
Thus, a = LTUT f where V-T = LTUT. In other words, Algorithm 4.6.1 
solves VTa = f by tacitly computing the "UL" factorization of V ^!. 

Consequently, the solution to the system Vz = b is given by 


z = V7'b = UCL) 
= {E(D Dg! --- L4 (DT DI) (Eua (m 21): Lo(z0)b) 
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This observation gives rise to the following slgorithm: 


Algorithm 4.8.2 Given z(0:n) € R'*! with distinct entries and b 
b(0:n) € R'*!, the following algorithm overwrites b with the solution z 
z(0:n) to the Vandermonde system V (ro, ...,r4)r = b. 


for k=0:n-1 
forizm-Lk-l 
b(i) = b(i) — z(k)b(i — 1) 


lou 


end 
end 
fork2n-1:- 1:0 
for i= k-4imn 
b(i) = &(i)/(z(i) — xs - k - 1) 
end 
for it = kin — 1 
b(i) = b(i) — &(1 + 1) 
end 
end 


This algorithm requires 5n?/2 flops. 


Example 4.6.2 Suppose Aigorithm 4.6.2 is used to solve 


11 1 1 žo D 
12 3 4 mn |_| -1 
14 9 16 a | 7 3| 
1 8 27 64 z3 35 


The first k-loop computes the vector 


Q 
Ls(3)La(2)L1(1) | i - | 
35 


The second k-loop then calculates 


ü 3 
Lo(1)7 D; Li QT Dr! Lay? D5! | E | = | 1 | 
35 I 


4.8.3 Stability 


Algorithms 4.6.1 and 4.6.2 are discussed and analyzed in Björck and Pereyra 
(1970). Their experience is that these algorithms frequently produce sur- 
Prisingly accurate solutions, even when V is ill-conditioned. They also 
show how to update the solution when a new coordinate pair (tn+41, fni) 
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is added to the aet of points to be interpolated, and haw to solve confluent 
Vandermonde systema, i.e., systems involving matrices like 


pd 
i 


V = V(zo,z1,71,23) = z 2 


© 
ER. o 
dhabi m 


5. 
AL 


4.6.4 The Fast Fourier Transform 
The discrete Fourier transform (DFT) matrix of order n is defined by 


a= (fa) fk-eP 


where 


Wn = exp( —2ri/n) = cos(2m/n) — i - sin(2x/n). 


The parameter wp is an nth root of unity because w = 1. In then = 4 
case, w4 = —i and 


i 1 1 1 1 1 1 1| 

p| l “a we we | |l -i-i i 
1711 wR wh wh | jai -2 1-1 
1 we wi wi 1 i -1 -i 


If +e €^, then its DFT is the vector Faz. The DFT has an extremely 
important role to play throughout applied mathematics and engineering. 
If n is highly composite, then it is possible to carry out the DFT in 
many fewer than the O(n?) flops required by conventional matrix-vector 
multiplication. To illustrate this we set n = 2! and proceed to develop 
the rudir-2 fast Fourier transform (FFT). The starting point is to look 
at an even-order DFT matrix when we permute its columns so that the 
even-indexed columns come first. Consider the case n — 8. Noting thst 


wh? = wki 2008 we have 
1 1 1 1 1 1 1 1 
l w ul! u^ w* w^ uf ww 
l w? wi 45 1 * wt wh 
Roll € € w wow «v of _ 
a) 1 wh 1 w 1 wt 1 wt J? w= we 
lw w^ ow? wh w ub oF 
l w? wi w^ 1 w wt a? 
low! oh wS wt q3 w* uw 
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If we define the index vector c = [0 2 46 1 35 7], then 


Fa(4c) = 


1 
1 
1 
1 
1 
1 
1 
1 


The lines through the matrix are there to help us think of F,(:,c} as a 
2-by-2 matrix with 4-by-4 blocks. Noting that w? = ug = w4 we see that 
| Fa) UWA ] 


Fe(:,c) = F, | otk 


where 
0 0 6 


1 

tH wg 0 0 
0 0 «l0 
0 0 0 vw 
It follows that if r in an 8-vector, then 


5 EIE 


Q4 ] [ Faz(0:2:7) | 
-f || Faz(L27) 
Thus, by simple scalings we can obtain the 8-point DFT y = Fyz from the 
4-point DFTs yr = F4z(0:2:7) and yg = F4x(1:2:7): 


u= 


Faz = F(:,c)2(c) 


y(0:4) = yr+d.+yg 
y£T) = yr-d.*yps. 
Here, 
1 
w 
d= wt 
us 


and ".«" indicates vector multiplication. In general, if n = 2m, then y = 
Faz is given by 


[ 


y(0:m — 1) Vr d.*yp 
ymn- 1) = ys—d.*up 
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where 
d = [1, we, umi]? 
vr = F,2(0:2:n~-1), 
ys = Fgz(l2:m- 1) 


For n = 2! we can recur on thia process until n = 1 for which Fir = T: 


function y = FFT(z,n) 
ifn=1 
y=r 
else 
m=n/2; w= e7aetin 
yr = FFT(z(0:2:n), m); yg = FFT(z(1:2:n), m) 
dz [1, Wye, um-i J”: zed.syp 


end 


This is a member of the fast Fourier transform family of algorithms. It 
has a nonrecursive implementation that is best presented in terms of a 
factorization of Fa. Indeed, it can be shown that Fh = A,---A,P, where 


Ay=1,@B8, L=%,r=n/L 
with 


a| a Rin 2d L/2-1 
B= | fu) -Arn and = Qz/2 = diag(ljun,....wp O). 
The matrix P, is called the bit reversal permutation, the description of 
which we omit. (Recall the definition of the Kronecker product "&" from 
84.5.5.) Note that with this factorization, y = Far can be computed as 
follows: f 


z= Paz 
for q = 1:t 
Lz?ur-n/L (4.6.5) 
= (0.8 Bi)z 
end 


The matrices A, = (1.9 BL} have 2 nonzeros per row and it is this sparsity 
that makes it possible to implement the DFT in O(nlogn) flops. In fact, 
a careful implementation involves 5n log; n flops. 

The DFT matrix has the property that 


1 1 
El IA = PL (4.6.6) 
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That is, the inverse of F, is obtained by conjugating its entries and scaling 
by n. A fast inverse DFT can be obtained from a (forward) FFT merely 
by replacing al! root-of-unity references with their complex conjugate and 
scaling by n at the end. 

The value of the DFT is that many “hard problems" are made simple 
by transforming into Fourier space (via Fn). The sought-after solution 
is then obtained by transforming the Fourier space solution into original 
coordinates (via Fy"). 


Problema 
P4.6.1 Show that if V = V(zo,...,4n), then 
dev) = [[ e-s) 


nži>j20 
P4.6.2 (Gautschi 1975a) Verify the following ‘equality for the n = 1 case above: 


1+ [xy 
Y V-t too < Ath 
otken "iR -a;l 


Equality results if the z; are all on the same ray in the complex plane. 


P4.6.3 Suppose w = fi, wns WE,.--, wpa- where n = 2!. Using colon notation, 
express 

[. wre wy, we Pot | 
as a subvector of w where r = 29, g = 1:2. 
P4.6.4 Prove (4.6.6). 


P4.6.5 Expand the operation z = (I @ Br)z in (4.6.5) into a double loop and count 
the number of flops required by your implementation. (Ignore the details of z = Paz. 
P4.8.8 Suppose n = 3m and examine 


= [Fal 0:3: -1) Fa(l:n -1) F&(:2:3:n — 1)] 


as a 3-by-3 block matrix, looking for scaled copies of Fm. Based on what you find, 
develop a recursive radix-3 FFT analogous to the radix-2 implementation in the text. 


Notes and References for Sec. 4.6 
Our discussion of Vandermonde linear systems is drawn from the papers 


A. Björck and V. Pereyra (1970). "Solution of Vandermonde Systems of Equations,” Math. 
Comp. 24, 893-903. . 

A. Björck and T. Elfving (1973). “Algorithms for Confluent Vandermonde Systems,” 
Numer. Math. 21, 130-37. 


The divided difference computations we discussed are detailed in chapter 2 of 


S.D. Conte and C. de Boor (1980). Elementary Numerical Analysis: An Algorithmic 
Approach, 3rd ed., McGraw-Hill, New York. 
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The latter reference includes an Algo! procedure. Erroz analyses of Vandermonde system 
solvers include 


N.J. Higham (1987b). “Error Analysis of the Bjoirck-Pereyra Algorithms for Solving 
Vandermonde Systems,” Numer. Math. 50, 613-632. 

N.J. Higham (19884). “Fast Solution of Vandermonde-like Systems Involving Orthogonal 
Polynomials,” IMA J. Num. Anal 8, 473-486. 

N.J. Higham (1990). “Stability Analysis of Algorithms for Solving Confluent Vandermonde- 
like Systems,” SIAM J. Matriz Anal. Appi, 11, 2341. 

S.G. Bartels and D.-J. Higham (1992). “The Structured Sensitivity of Vandermonde-Like 
Systema," Numer. Math. 68, 17-4. 

J.M. Varah (1993). "Errors and Perturbations in Vandermonde Systems,” [MA J. Num. 
Anal 13, 1-12. 


Interesting theoretical results concerning the condition of Vandermonde systems may be 
found in 


W. Gautschi (1975a}. “Norm Estimates for Inverses of Vandermonde Matrices,” Numer. 
Math, 23, 337-47. 

W. Gautachi (1975b). “Optimally Conditioned Vandermonde Matrices,” Numer. Math. 
£j, 1-12. 


The basic algorithms presented can be extended to cover confluent Vandermonde sys- 
terns, block Vandermonde systems, and Vandermonde systems that are based on other 
polynomial bases: 


G. Galimberti and V. Pereyra (1970). "Numerical Differentiation and the Solution of 
Multidimensional Vandermonde Systema," Math. Comp. 24, 357-64. 

G. Galimberti and V. Pereyra (1971). "Solving Confluent Vandermonde Systems of 
Hermitian Type,” Numer. Math, 18, 44-60. 

H. Van de Vel (1977). "Numerical Treatment of a Generalized Vandermonde systems of 
Equations,” Lin. Alg. and Its Applic. 17, 149-74, 

G.H. Golub and W.P Tang (1981). “The Block Decomposttion of a Vandermonde Matrix 
and Its Applications,” BIT 21, 505-17. 

D. Calvetti and L. Raichel (1992). “A Chebychev-Vandermonde Solver,” Lin. Alg. and 
lis Applic. 172, 219-229. 

D. Calvetti and L. Reichel (1993). “Fast Inversion of Vandermonde-Like Matrices In- 
volving Orthogonal Polynomials.” BIT 33, 473-484. 

H. Lu (1904). “Fast Solution of Confluent Vandermonde Linear Systems,” SIAM J. 
Matriz Anal Appi. 15, 1277-1289. 

H. Lu (1996). "Solution of Vandermonde-like Systeme and Confluent Vandermonds-lika 
Systema," SIAM J. Matriz Anal. Appl. 17, 127-138. 


The FFT literature is very extensive and acattered. For an overview of the area couched 
in Kronecker product notation, see 


C.F, Van Loan (1992). Computational Promeworks for the Fast Fourier Transform, 
SIAM Publications, Philadelphia, PA. 


The point of view in this text is that different FFTs correspond to different factorizationa 
of the DFT matrix. These are sparse factorizations in that the factors have very few 
nonzeros per row. 
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4.7 ‘Toeplitz and Related Systems 


Matrices whose entries are constant along each diagonal arise in many ap- 
plications and are called Toeplitz matrices. Formally, T € E'*" is Toeplitz 
if there exist scalars r_n4i,---,T0,---,Tn—1 such that ay = rji for all i 
and j. Thus, 


To fi rg 4 3176 

T= T-1 fo Fi To _ 431 7 
Fug fat To ři 7 043 1 

f-3 F-2 T.i To 9043 


is Toeplitz. 

Toeplitz matrices belong to the larger class of persymmetric matrices. 
We say that B € R™*" is persymmetric if it symmetric about its northeast- 
southwest diagonal, i.e., bi; = bn-j+1,n—:i+1 for alli and j. This is equivalent 
to requiring B = EBTE where E = [e,,...,€1] = In(:,n: - 1:1) is the 
n-by-n erchange matriz, ie., 


000 L 
0010 
E-lo100 
1000 


Tt is easy to verify that (a) Toeplitz matrices are persymmetric and (b) the 
inverse of a nonsingular Toeplitz matrix is persymmetric. In this section we 
show how the careful exploitation of (b) enables us to solve Toeplitz systems 
with O(n?) flopa. The discussion focuses on the important case when T is 
also symmetric and positive definite. Unsymmetric Toeplitz systems and 
connections with circulant matrices and the discrete Fourier transform are 
briefly discussed. 


4.7.1 Three Problems 


Assume that we have scalars r;,...,r4 such that for k = 1:n the matrices 


1 Tr 0c T2 Tk] 
"n 1 a Tk-2 
Tk = 
Feed *. "n "n 
Te-t Tk-2 c n 1 


are positive definite. (There is no loss of generality in normalizing the 
diagonal.) Three algorithms are described in this section: 
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è Durbin's algorithm for the Yule- Walker problem Tay = —[ri,-.., ra]. 
e Levinson's algorithm for the general righthand side problem Tar = b. 
e Trench’s algorithm for computing B = T, !. 
In deriving these methods, we denote the k-by-k exchange matrix by Ep, 
ie, Ej = h(i k: — 1:1). 


4.7.2 Solving the Yule-Walker Equations 


We begin by presenting Durbin's algorithm for the Yule-Walker equations 
which arise in conjunction with certain linear prediction problems. Suppose 
for some k that satisfies 1 < k < n — 1 we have solved the k-th order Yule- 
Walker system Tyg = —r = —(ri,...,rk)7. We now show how the (k+1)-st 
order Yule-Walker system 


Tk Eyr z - _ r 
rT E, 1 a E Pet 
can be solved in O(k) flops. First observe that 
z = Tg (-r-aE,r) =y- aT, Err 


and 
Qa = —Tk41— rT Erz. 


Since T, L js persymmetric, T, Ig, = ET, ! and thus, 
z-2y- ak,T, 'r -yctaE,y. 
By substituting this into the above expression for a we find 
a = -rpp — 7T Ey(y + okey) = —(reg +r” Eky)/(1 7T y). 


The denominator is positive because Ty, is positive definite and because 


I Ey] [ h% Er}f[I By] [h 0 
0 1 rTE, 1 0 1 [0 L+rty 


We have illustrated the kth step of an algorithm proposed by Durbin (1960). 
It proceeds by solving the Yule- Walker systems 


Thy) = -rtk) = ri, f TIT 


for k = I:n as follows: 
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y = -r 
for k=1:n-1 
fy = 1 t [ryt (9 
ay = (reyi 1097 Ey (9/0, (4.7.1) 


ze = yy) + ax Ey 


end 


As it stands, this algorithm would require 3n? flops to generate y = y("!. 
It is possible, however, to reduce the amount of work even further by ex- 
ploiting some of the above expressions: 


Bk 1+ [r(9)7,09 
(k-1) (k-1) 
14 [ ré-0T Y. ] | y + aki Ek iN | 


ük-1 


W 


(1+ [rt DIT T Dy a, (p^ 97 E, y 67? rx) 


Bena o iC 7 Bk ioi) 
 - od 1A: 


Using this recursion we obtain the following algorithm: 


Algorithm 4.7.1. (Durbin) Given real numbers 1 = ro, 71,...,r4 such 
that T = (ri) € IR" is positive definite, the following algorithm com- 
putes y € R® such that Ty = —(r1,...,r4)7. 
y(1) = -r(1; 821; a = —r(1) 
for k = in - 1 
Bü - aj 
a = — (r(k +1) + r(k: — 1:2) T y(E:k)) /8 
2(1:k) = y(1:k) + ay(k: — 1:1) 
y(l:k +1) = | zB) 
end 


This algorithm requires 2n? flops. We have included an auxiliary vector z 
for clarity, but it can be avoided. 


Example 4.7.1 Suppose we wish to solve the Yule-Walker system 


1 5 2 n 5 
5 1.5 y»21-2-12 
2 5 1 ys -l 
using Algorithm 4.7.1. After one paas through the loop we obtain 


a=1/l5, — B-3/4 v= [ "p^ ] 
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We then compute 

(1 - o?8 = 56/75 

—(ra + rayı + riy2)/ = -1/28 
yi + aya = —225/420 

ta tay, = —36/420, 


R 
tou 


4 
E. 
"od 


giving the final solution y = [-75, 12, -5]7 /140. 


4.7.3 The General Right Hand Side Problem 


With a little extra work, it is possible to solve a symmetric positive definite 
Toeplitz system that has an arbitrary right-hand side. Suppose that we 
have solved the system 


Thx =b = (b,b)! (4.7.2) 


for some k satisfying 1 < k < n and that we now wish to solve 


FESTE 


Here, r = (ri,...,7&)7 as above. Assume also that the solution to the kth 
order Yule- Walker system Ty = —r is also available. From Tv -+ pEgr = b 
it follows that 


v = Tro! (b — peer) = z — pT E, = T + Ery 
and so 


ban - rT Ev 
big 7 7T Eye — pry 
(bri - rT Ez) / (1 + r^y) . 


Consequently, we can effect the transition from (4.7.2) to (4.7.3) in O(k) 
flops. 

Overall, we can efficiently solve the system Tux = b by solving the sys- 
tems Tax) = b) = (b... b. )T and Tay = —r®) = (ruis ru )T “in 
parallel" for k = 1:n. This is the gist of the following algorithm: 


* 
Hog 


it 


Algorithm 4.7.2 (Levinson) Given be E" and real numbers 1 = 
ToT,- -Tn such that T = (rj ,) € R'*" is positive definite, the fol- 
lowing algorithm computes x € IR" auch that Tz = b. 


y(1) = —r(1); z(1) = b(1; 8 = 1; a = —r(1) 
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fork = l:n-1 
B= (1-a7)8; p= (b(k + 1) — r(1:k)" x(k: — 1:1)) /8 
v(1:k) = z(1:k) + uy(i: — 1:1) 


sik +1) = | (9 


ifk«n-1 " 
a = (—r(k- 1) e r(Ek)Ty(k: - 1:1)) /8 
z(1:k) = y(L:k) + ay(k: — 1:1) 


ylik+) = | xe) 


end 
end 


This algorithm requires 4n? flops. The vectors z and v are for clarity and 
can be avoided in a detailed implementation. 


Example 4.7.2 Suppose we wish to solve the symmetric positive definite Toeplitz 


1 5 2 zi 4 
6 101.5 ze |2-| -1 
2 5 1 £3 3 
using the above algorithm. After one pass through the loop we obtain 
- - _ [ -8/15 _ 6 
a= 1/15, Bx 3/4, v=| 1/15 ] pi] 
We then compute 


B = (l-a7)8=56/78 p = (ba —riza — rax1)/8 = 285/56 
va oc ida = 355/56 v = 22 + py, = 376/56 


giving the final solution z = (355, —376, 285|7 /56. 


4.7.4 Computing the Inverse 


One of the most surprising properties of a symmetric positive definite 
Toeplitz matrix T, is that its complete inverse can be calculated in O(n?) 
flops. To derive the algorithm for doing this, partition Ty} as follows 


- |A Er 3 [B8 vw 
Th = E 1 | = E :| (4.7.4) 
where A = T,_1, E = E, 4, and r = (ri,...,741)7. From the equation 
A Er v| 40 
TTE 1 yi [i1 


it follows that Av = —yEr = —TE(ri...,ra-i1)" and y = 1 - rT Ev. 
If y solves the (n — 1)-st order Yule-Walker system Ay = —r, then these 
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expressions imply that 
y = Mürty) 
v = yEy. 


Thus, the last row and column of T=! are readily obtained. 
It remains for us to develop working formulae for the entries of the 
submatrix B in (4.7.4). Since AB + Ery? = I, ,, it follows that 


T 
B = A`! — (A7!Er}o" = AC + = . 


Now since A = T4; is nonsingular and Toeplitz, its inverse is persymmet- 
ric. Thus, 


hy = (Ay + PEE 


(A^ nsi + E (4.7.5) 


Us —- jUn- i + Uit; 
T Y 


1 
= bn-jn-i + g (i — Un—jUn-i) - 


= bn—j,n-i 


This indicates that although B is not persymmetric, we can readily compute 
an element b; from its reflection across the northeast-southwest axis. Cou- 
pling this with the fact that 47! is persymmetric enables us to determine 
B from its "edges" to ita "interior." 

Because the order of operations is rather cumbersome to describe, we 
preview the formal specification of the algorithm pictorially. To this end, 
assume that we know the last column and row of TT !: 


uuu u uk 
u u su u u k 
pi |e eee 4 k 
" "[|u u u u u k 
u u u u u k 
kkkkk k 


Here u and k denote the unknown and the known entries respectively, and 
n = 6. Alternately exploiting the persymmetry of 7, ! and the recursion 
(4.7.5), we can compute B, the leading (n — 1)-by-(n — 1) block of TZ}, as 
follows: 


k k k k k k k k k k EER 
k uuuuk k u u u k Kk 
persym. k uv uu u k (4.7.5) kuunuk k 
k uv u u ik k u u u k k 
& u u u u ik k k k k k k 
kkk k kk kkk kkk 
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kkk k kk kkk kK 
kk k k kk k k k k kk 
pranm. d k k u u k k|(ar| k ku k k k 
—' lk kuukk| '|kkkkkk 
k kk k kk k k k k k k 
k k k k k k k kk k kk 

k k k k k k 

kk k k k k 

prym | kK ok k k k k 

CC dk k k k kk 

kk k k kk 

k k k k k k 


Of course, when camputing a matrix that is both symmetric and persym- 
metric, such as 7, !, it is only necessary to compute the “upper wedge" of 
the matrix—e.g., 


x 

x x (n = 6} 

x 

With this last observation, we are ready to present the overall algorithm. 


Algorithm 4.7.3 (Trench) Given real numbers 1 = rg,r1,..., ra such 
that T = (ry. p € R^*" is positive definite, the following algorithm com- 
putes B = T.!. Only those 5, for which i € j and ¿+j € n-- 1 are 
computed. 


Use Algorithm 4.7.1 to solve T4 y = —(ri,. «iM. 
*y 2 A/(1 4 r(En = 1)7Tg(1:n — 1)) 
ulim — 1) 2 yy(n — 1: — 1:1) 
B(11)- 
B(1,2:n) = v(n — 1: = 1:1)7 
for i = 2:flioor((n — 1)/2)+1 

for j2in-i-cl 

B(i,j) = B(i-1,j- + 
(v(n -1— 3*(n- 1 ~ 4) - v(i— Dv( 1) /v 

end 

end 


This algorithm requires 13n7/4 flops. 


Example 4.T.3 [f the above algorithm is applied to compute the inverse B of the 
positive definite Toeplita matrix 

1.5 2 

5 1 5], 

2 6 1 
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then we obtain + = 75/56, bi = 75/56, biz = —5/7, big = 5/56, and bya = 12/7. 


4.7.5 Stability Issues 


Error analyses for the above algorithms have been performed by Cybenko 
(1978), and we briefly report on some of his findings. 

The key quantities turn out to be the a in (4.7.1). In exact arithmetic 
these scalars satisfy 


lak| <1 
and can be used to bound | T7! |]: 
i 


Ha-a n-i 1+ fay] 
-1 Pi 
max) fja- [[a-epo [sS 7! S | ia | (4.7.6) 
j=l jei - 


Moreover, the solution to the Yule- Walker system Thy = —r(1:n) satisfies 


lvl = (Ite +a) -1 (4.7.7) 


k=l 


provided all the a, are non-negative. 
Now if Ż is the computed Durbin solution to the Yule-Walker equations 
then rp = Taĉ + r can be bounded as follows 


frol s uITG 1&0 


kml 


where à, is the computed version of œg. By way of comparison, since 
each |r,| is bounded by unity, it follows that || rc || = ull y lj, where rc is 
the residual associated with the computed solution obtained via Cholesky. 
Note that the two residuals are of comparable magnitude provided (4.7.7) 
holds. Experimental evidence suggests thnt this is the case even if some of 
the c are negative. Similar comments apply to the numerical behavior of 
the Levinson algorithm. 

For the Trench method, the computed inverse Ê of Ty! can be shown 
to satisfy 


-B li 1+ jâ;l 
E Tz Wi Il Bia | 
In light of (4.7.7) we see that the right-hand side is an approximate upper 


bound for ujj T; ! || which is approximately the size of the relative error 
when T;! is calculated using the Cholesky factorization. 
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4.7.6 The Unsymmetric Case 


Similar recursions can be developed for the unsymmetric case. Suppose we 
are given scalars rj, .. .,Ta-1; Pr, --.,Pa- 1, and by, ..., 04 and that we want 
to solve a linear system Tx = b of the form 


1 ri re rs Te Zi 
Mm ol or or fs T2 
Pm lnn zy | = (n = 5). 
Pa Ps Pio 1n za 

P B m p 1 Ts 


TPIT 


In the process that follows we require the leading principle submatrices 
Ty = T(1:k, 1:4), k = ln to be nonsingular. Using the same notation as 
above, it can shown that if we have the solutions to the k-by-k systems 


Try = -r = = [rir e te)” 
Tw = -p = -[np»is ml (4.7.8) 
Tz =b = [hh], 


then we can obtain solutions to 
Tk Ekr T z 
pe, 1 a 
Te. Fer u _ P 47.9 
[te Tl] Dn] € 


[fe T] c Dx] 

PUE 1 n beg 

in O(k) flops. This means that in principle it is possible to solve an unsym- 
metric Toeplitz system in O(n?) flops, However, the stability of the process 
cannot be assured unless the matrices T, = T(1:k, 1:k) are sufficiently well 
conditioned. 
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u 


4.7.7  Circulant Systems 


A very important class of Toeplitz matrices are the circulant matrices. Here 
is an example: 


Uy) Va Uy U tU 
uU wW TU. Us h 
Clu) = | v2 n vo u ts 
Ug Uz Ui Uo t4 
uU Va U2 UU) h 
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Notice that each column of a circulant is a "downshifted" version of its 
predecessor. In particular, if we define the downshift permutation Sn by 


00001 
10000 
$,2|0100 0 (n =5) 
0010900 
00010 


and v = [vo vi = Unei |’, then C(v) = fu, Sav, S2v,..., SB 1v]. 
There are important connections between circulant matrices, Toeplitz 
matrices, and the DFT. First of all, it can be shown that 
C(v) = Fr'disg(Fav) Fn. (4.7.10) 


This means that a product of the form y = C(v)z can be solved at “FFT 
speed”: 


£-F 
v= Fyv 
PEDE: 
y= Fi 


In other words, three DFTs and a vector multiply suffice to carry out the 
product of a circulant matrix and a vector. Products of this form are called 
convolutions and they are ubiquitous in signal processing and other areas. 

Toeplitz-vector products can also be computed fast. The key idea is 
that any Toeplitz matrix cau be “embedded” in a circulant. For example, 


5 2 7 
T2i14 5 2 
9 4 5 


In general, if T = (¢:;) is an n-by-n Toeplitz matrix, then T = C(1:n, L:n) 
where C e R(7#-1)*@a-4) is a circulant with 


AL T(1m,1) 
C(,1)-2 | T(1,n: 2 1:2)” | ` 


Note that if y = Cz and z(n+1:2n—1) = 0, then y{1:n) = Tz(1:n) showing 
that Toeplitz vector products can also be computed at “FFT speed.” 
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Problems 


P4.7.1 For any v € R” define the vectors vy = (v + E,v)/2 and v- = (v — Es,v)/2. 
Suppose A € F7 is symmetric and persymmetric. Show that if Ax = bthen Az, zb; 
and Az =6_. 

P4.7.2 Let U € R^*" be the unit upper triangular matrix with the property that: 
U(l:k — 1,4) = Ey i yl 57U where y(9 is defined by (4.7.1). Show that 


UTTAU = diag(1, h,..., Pn 1). 


P4.7.3 Suppose z € R” and that S € EC*" is orthogonal. Show that if 
X= [z Sz, 4 82712] 


then XT X is Toeplitz. 

P4.7.4 Consider the LDLT factorization of an n-by-n symmetric, tridiagonal, positive 
definite Toeplitz matrix. Show that d, and /4,4-: converge as n — oo. 

P4.7.5 Show that the product of two lower triangular Toeplitz matrices is Toeplitz. 
P4.7.6 Give an algorithm for determining u € R such that 


Ta +h [enei + eel) 


is singular. Assume Ts = (rj. |) ia positive definite, with ro = 1. 

P4.7.7 Rewrite Algorithm 4.7.2 so that it does not require the vectors z and v. 
P4.7.8 Give an algorithm for computing &oo(Tx) for k = L:n. 

P4.7.9 Suppose Ai, Az, Ag and A4 are m-by-m matrices and that 

Ag At Az As 

Ay Ap A Ay 

Az As Ao Ai 

Ai Az Az Ao 

Show that there is a permutation matrix II such that IIT AN = C = (C,;) where each 
Gy i5 a 4-by-4 circulant matrix. 

P4.7.10 A p-by-p block matrix A = (A,;) with m-by-m blocks is block Toeplitz if there 
exist Appi... A- Ao, A,- -s Apol € RPE so that Aij = Acsi Ox, 

Ao A AA As 

A-1 Ag Ai Ag 

Anz A-1 Ao Al 

A-3 A-1 A-1 Ao 


(a) Show that there is a permutation II such that 


A= 


A= 


Tu Tij 5 Tim 
nAn =: | 7 Te 
Tmi ttt Tram 


where each Tj; is p-by-p and Toeplitz. Each Tij should be “made up" of (i,j) entries 
selected from the A, matrices. (b) What can you any about the Ty if A, = Ass 
k=lp-i? 

P4.7.11 Show how to compute the solutions to the systems in (4.7.9) given that the 
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solutions to the systems in (4.7.8) are available Amume that all the matrices involved 
are nonsingular. Proceed to develop a fast unsymmetric Toeplits solver for Tx = b 
asmuming that 7T"s leading principle submatrices are all nonsingular. 

P4.7.12 A matrix H c R'*" is Hankel if H(n: — 1:1,:) is Toeplitz. Show that if 
A € E^*" is defined by 


b 
ag = f eoe(k0) cos(j8)d8 
a 
then A is the sum of a Hankel matrix and Toeplitz matrix. Hint. Make use of the 
identity coa(u + v) = cos(u) coe(v) — sin(u) sin(v). 
P4.7.13 Verify that Fa C(v) = diag(Fav)Fa. 
P4.7.14 Show that it is posible to embed a symmetric Toeplitz matrix into a symmetric 
P4.7.15 Consider the kth order Yule-Walker system Try? = —r(*) that arises in 
(4.7.3): 


Vai T1 
TA : = 
Ukk Tk 
Show that if 
1 0 0 ü 0 
yu 1 ü o D 
vn yn 1 0 (H 
L= v33 var yn 1 0j 

Zn—i1,A-1  Ya-i,n-2  Wa-l;-3 ` Ya- 1 


then LILT = diag(l, £i... An-1) where By = 1 4 r7 yO). Thus, the Durbin 
algorithm can be thought of as a fast method for computing the LDL7 factorization of 
Tx. 
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This chapter is primarily concerned with the least squares solution of 
overdetermined systems of equations, i.e., the minimization of | Az — b ila 
where A € R'"*" with m > n and b € R™. The most reliable solution pro- 
cedures for this problem involve the reduction of A to various canonical 
forms via orthogonal transformations, Householder reflections and Givens 
rotations are central to this process and we begin the chapter with a discus- 
sion of these important transformations. In §5.2 we discuss the computation 
of the factorization A = QR where Q is orthogonal and R is upper trian- 
gular. This amounts to finding an orthonormal basis for ran(A). The QR 
factorization can be used to solve the full rank least squares problem as we 
show in §5.3. The technique is compared with the method of normal equa- 
tions after a perturbation theory is developed. In §5.4 and §5.5 we consider 
methods for handling the difficult situation when A is rank deficient (or 
nearly so). QR with column pivoting and the SVD are featured. In §5.6 we 
discuss several steps that can be taken to improve the quality of a computed 
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This chapter is primarily concerned with the least squares solution of 
overdetermined systems of equations, i.e., the minimization of | Az — b ila 
where A € R'"*" with m > n and b € R™. The most reliable solution pro- 
cedures for this problem involve the reduction of A to various canonical 
forms via orthogonal transformations, Householder reflections and Givens 
rotations are central to this process and we begin the chapter with a discus- 
sion of these important transformations. In §5.2 we discuss the computation 
of the factorization A = QR where Q is orthogonal and R is upper trian- 
gular. This amounts to finding an orthonormal basis for ran(A). The QR 
factorization can be used to solve the full rank least squares problem as we 
show in §5.3. The technique is compared with the method of normal equa- 
tions after a perturbation theory is developed. In §5.4 and §5.5 we consider 
methods for handling the difficult situation when A is rank deficient (or 
nearly so). QR with column pivoting and the SVD are featured. In §5.6 we 
discuss several steps that can be taken to improve the quality of a computed 
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least squares solution. Some remarks about square and underdetermined 
systems are offered in $5.7. 


Before You Begin 


Chapters 1, 2, and 3 and §§4.1-4.3 are assumed. Within this chapter 
there are the following dependencies: 


85.6 
T 
$51 — $52 — 853 — $54 — $55 
4 
85.7 


Complementary references include Lawson and Hanson (1974), Farebrother 
(1987), and Björck (1996). See also Stewart (1973), , Hager (1988), Stewart 
and Sun (1990), Watkins (1991), Gill, Murray, and Wright (1991), Higham 
(1996), Trefethen and Bau (1996), and Demmel (1996). Some MATLAB 
functions important to this chapter are qr, svd, pinv, orth, rank, and the 
"backsiash" operator ^." LAPACK connections include 


Householder times matrix 

Small n Householder times matrix 

Block Householder times matrix 

Computes J — V TV block reflector representation 
Generates a plane rotation 

Generates a vector of plane rotations 

Applies a vector of plane rotations to a vector pair 
Applies rotation sequence to 4 matrix 

Real rotation times complex vector pair 

Complex rotation (c real) times complex vector pair 
Complex rotation (s real) times complex vector pair 


A=QR 

Al z QR 

Q (factored form) times matrix (real case) 

Q (factored form) times matrix (compiex case) 


n pper trisogular 

QL = (orthogonal) (lower triangular) 

LQ = (lower trinngular)(orthogonal) 
A= RQ where A is upper trapezoidal 


Bidiagonalization of general matrix 
Generates the orthogonal transformations 
Bidiagoualization of band matrix 
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GELS F 
GELSS | SVD solution to min | AX — B IJ, 


5.1  Householder and Givens Matrices 


Recall that Q € IR"*" is orthogonal if QTQ = QQT = In. Orthogonal 
matrices have an important role to play in least squares and eigenvalue 
computations. In this section we introduce the key players in this game: 
Householder reflections and Givens rotations. 


5.1.1 <A 2-by-2 Preview 


It is instructive to examine the geometry associated with rotations and 
reflections at the n = 2 level. A 2-by-2 orthogonal matrix Q is a rotation: 


if it has the form 
_ cos(8)  sin(8) 
Q= a cos(8) | - 


If y = QT z, then y is obtained by rotating x counterclockwise through an 
angle 8. 
À 2-by-2 orthogonal matrix Q is a reflection if it has the form 


a= [x 8] 


If y = QTz = Qz, then y is obtained by reflecting the vector x across the 
line defined by 
- cos(8/2) 
$= span {| sin(8/2) ) 


Reflections and rotations are computationally attractive because they are 
easily constructed and because they can be used to introduce zeros in à 
vector by properly choosing the rotation angle or the reflection plane. 


Example 5.1.1 Suppose r —[1, v/3]T. If we set 


Q= cos(—60°) sin(—60°) | _ 1/2 -V3/2 
=sin(-60°) cos(-607) | | 3/2 1/2 
then Q7 x = (2, 0[7. Thus, a rotation of -60° zeros the second component of z. If 
Q= cos(30?)  sin(30°) l | vsm 1/2 
sin(30°) — cos(30°) 1/2 -/3/2 


then QT x = [2, 0]T. Thus, by reflecting z across the 30° line we can zero its second 
component. 
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5.1.2 Householder Reflections 
Let v € R” be nonzero. An n-by-n matrix P of the form 


P=- w 


Fr (5.1.1) 


is called a Householder reflection. (Synonyms: Householder matrix, House- 
holder transformation.) The vector v is called a Householder vector. If a 
vector x is multiplied by P, then it is reflected in the hyperplane span{v}+. 
It is easy to verify that Householder matrices are symmetric and orthogonal. 

Householder reflections are similar in two ways to Gauss transforma- 
tions, which we introduced in $3.2.1. They are rank-1 modifications of the 
identity and they can be used to zero selected components of a vector. In 
particular, suppose we are given 0 x x € IR" and want Pr to be a multiple 
of e = 7,(:,1). Note that 


and Pr € span{e;} imply v € span(z, e1}. Setting v = z + ae, gives 


vir-zlr4 Qr, 


and 


vty 2 zT zr 2az, +a’, 


and therefore 


T T. 
z!r-az vr 
2[1-2—— Ue - oe 
Pr . ( Tie i). uty 


In order for the coefficient of x to be zero, we set a = +|} z [|a for then 
wT 
vÉIX I r lae: » Pr= (r — 2.) z= Fl r llae:- (5.1.2) 
It is this simple determination of v that makes the Householder reflection 
so useful. 
Example 5.1.2 If z — (3, 1, 5, 1]|T and v = (9, 1, 5, 1|, then 
-77 -9 -45 -9 
wre -9 53 -5 -1 
= -45 -5 9 -5 
-9 -1 -5 5 


has the property that Pr = [—6, 0, 0, 0, ]T. 
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5.1.3 Computing the Householder Vector 


There are a number of important practical details associated with the deter- 
mination of a Householder matrix, i.e., the determination of a Householder 
vector. One concerns the choice of sign in the definition of v in (5.1.2). 
Setting 
v =z, -|| z lla 

has the nice property that Prz is a positive multiple of ej. But this recipe is 
dangerous if z is close to a positive multiple of e; because severe cancellation 
would occur. However, the formula 


ziola _ Gibt 
zi zi, zi d [[ [lo 


suggested by Parlett (1971) does not suffer from this defect in the z4 > 0 
case. 
In practice, it is handy to normalize the Householder vector so that 
v(1) = 1. This permits the storage of v(2:n) where the zeros have been 
introduced in z, ie., z(2:n). We refer to v(2:n) as the essential part of 
the Householder vector. Recalling that G  2/v' v and letting length(z) 
specify vector dimension, we obtain the following encapsulation: 


n = -lirl = 


Algorithm 5.1.1 (Householder Vector) Given z € R”, this function 
computes v € R* with v(1) = 1 and 7 € IR such that P = Ip — Gov" is 
orthogonal and Pz = || x [|;e1. 


function: [v, 8] = house(z) 


n = length(z) 
o = 2(2:n)Tx(2:n) 
fı 
v=} x(2n) 
ifc =0 
B-0 
else 
p= Vz(1)? +o 
if z(1) <=0 
v(1) = z(1)- p 
else 
v(1) = —e/(x(1) + a) 
end 
8 = 2u(1)?/(o + v(1)?)} 
v = v/v(1) 


end 


This algorithm involves about 3n flops and renders a computed Householder 
matrix that is orthogonal to machine precision, a concept discussed below. 
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A production version of Algorithm 5.1.1 may involve a preliminary scaling 
of the x vector (z + z/|| z ||) to avoid overflow. 


5.1.4 Applying Householder Matrices 


It is critical to exploit structure when applying a Householder reflection to 
a matrix. If A € IR™*" and P = I — Suv? c R™™™, then 


PA= (I - BwT) A = A — vw 
where w = GAT v. Likewise, if P = J — fvw? € R*", then 
AP = A(I - Bw) = A — uvT 


where w = f Áv. Thus, an m-by-n Householder update involves a matrix- 
vector multiplication and an outer product update. It requires 4mn Hops. 
Failure to recognize this and to treat P as a general matrix increases work 
by an order of magnitude. Householder updates never entail the erplicit 
formation of the Householder matriz. 

Both of the above Householder updates can be implemented in a way 
that exploits the fact that v(1) = 1. This feature can be important in the 
computation of PA when m is small and in the computation of AP when 
n is small. 

As an example of a Householder matrix update, suppose we want to 
overwrite A € R™*" (m > n) with B = QTA where Q is an orthogonal 
matrix chosen so that B(j + 1:m, j) = 0 for some j that satisfies 1 € j x n. 
In addition, suppose A(j:m,1:j — 1) = Q and that we want to store the 
essential part of the Householder vector in A(j + 1:m,j). The following 
instructions accomplish this task: 


[v, 8] = house(AC:m, j)) 
A(j:m, jm) = (Im-j41i - BvvT) AC: m, jn) 
A(j + lim, j) = v(2im — 7 1) 


From the computational point of view, we have applied an order m — j +1 
Householder matrix to the bottom m — j + 1 rows of A. However, mathe- 
matically we have also applied the m-by-m Householder matrix 


= IH.i 0 == - 0 
ripe [t 


to A in its entirety. Regardless, the "essential" part of the Householder 
vector can be recorded in the zeroed portion of A. 
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5.1.5 Roundoff Properties 


The roundoff properties associated with Householder matrices are very fa- 
vorable. Wilkinson (1965, pp. 152-62) shows that house produces a House- 
holder vector ô very near the exact v. If P = I — 2067 /07 9 then 


| P - P la = O(u) 


meaning that P is orthogonal te machine precision. Moreover, the com- 
puted updates with P are close to the exact updates with P : 


P(A+E) || Eliz = O(uli A liz) 


A 


SUPA) 


fUAP) (A--E)P  fEMlz = O(ul All) 


5.1.6 — Factored Form Representation 


Many Householder based factorization algorithms that are presented in the 
following sections compute products of Householder matrices 


z F T 
Q-2QQ-Q.  Qj-I-Bj V0 (5.1.3) 
where r < n and each vU) has the form 


v3 = (0, 25 1995, ... uD). 


j-1 


It is usually not necessary to compute Q explicitly even if it is involved in 
subsequent calculations. For example, if C c IR"** and we wish to compute 
QTC , then we merely execute the loop 


for j = l:r 
C =Q;C 
end 


The storage of the Householder vectors v!!)...ul") and the corresponding 
B, (if convenient) amounts to a factored form representation of Q. To 
illustrate the economies of the factored form representation, suppose that 
we have an array A and that A(j + L:n, 7) houses v) (j + 1:n), the essential 
part of the jth Householder vector. The overwriting of C e IR?** with 
QTC can then be implemented as follows: 


for j-Lr 
v(j:n) = | AG AP | (5.1.4) 
C(jn,:) = ( — B,v(j:n)v(3:n)T )C(n, :) 


end 
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This involves about 2gr(2n — r) flops. If Q is explicitly represented as an 
n-by-n matrix, QT C would involve 2n?q flops. 

Of course, in some applications, it is necessary to explicitly form Q 
(or parts of it). Two possible algorithms for computing the Householder 
product matrix Q in (5.1.3) are forward accumulation, 


Q =n 
for j = 1r 

Q = QQj 
end 


and backward accumulation, 


Q-Hh 

for j-r:—11 
Q-Q;Q 

end 


Recall that the leading (j — 1)-by-(j — 1) portion of Q; is the identity. Thus, 
at the beginning of backward accumulation, Q is “mostly the identity" and 
it gradually becomes full as the iteration progresses. This pattern can be 
exploited to reduce the number of required flops. In contrast, Q is full 
in forward accumulation after the first step. For this reason, backward 
accumulation is cheaper and the strategy of choice: 


Q = ia 
for j2r:- 11 
, 1 
vijn) = | A(j + l:n, j) 
Qim jin) = ( - Bj Gi)eG:m)T ) QU, fn) 


(5.1.5) 


end 


This involves about 4(n?r ~ nr? + r?/3) flops. 


5.1.7 A Block Representation 


Suppose Q = Q,---Q, is a product of n-by-n Householder matrices as 
in (5.1.3). Since each Q; is à rank-one modification of the identity, it 
follows from the structure of the Householder vectors that @ is a rank-r 
modification of the identity and can be written in the form 


Q-I-WYT (5.1.6) 


where W and Y are n-by-r matrices. The key to computing the block 
representation (5.1.6) is the following lemma. 
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Lemma 5.1.1 Suppose Q = 1+WYT is an n-by-n orthogonal matriz with 
WY eR), If P =I- uvT with v c R" and z = -fiQv, then 
Q4 = QP = IE W,YT 
where W} = [W z] and Y} = [Y v] are each n-by-(5 + 1). 
Proof. 
QP 


H 


(1 + WYT) (1 — BuvT) = 1 + WYT — pOu? 
= I+WYT +a =I +[W z] [Yv D 


By repeatedly applying the lemma, we can generate the block representa- 
tion of Q in {5.1.3) from the factored form representation as follows: 


Algorithm 5.1.2 Suppose Q = @,---Q, is a product of n-by-n House- 
holder matrices as described in (5.1.3). This algorithm computes matrices 
WY c R?** such that Q = I + WYT. 


Y= 

W = -fv 

for j =2:r 
z= lI + WT) 
W za IW z] 
Y z [Y v] 

end 


This algorithm involves about 2r?n — 2r3/3 flops if the zeros in the v4) are 
exploited. Note that Y is merely the matrix of Householder vectors and is 
therefore unit lower triangular. Clearly, the central task in the generation 
of the WY representation (5.1.6) is the computation of the W matrix. 

The block representation for products of Householder matrices ia attrac- 
tive in situations where Q must be applied to a matrix. Suppose C € R°*?. 
It follows that the operation 


C QTC = (I+ wyT)o=C+Y(WTc) 


is rich in level-3 operations. On the other hand, if Q is in factored form, 
QTC is just rich in the level-2 operations of matrix-vector multiplication 
and outer product updates. Of course, in this context the distinction be- 
tween level-2 and level-3 diminishes as C gets narrower. 

We mention that the “WY” representation is not a generalized House- 
holder transformation from the geometric point of view. True block reflec- 
tors have the form Q = I — 2VV? where V € R'** satisfies VTV = L. 
See Schreiber and Parlett (1987) and also Schreiber and Van Loan (1989). 


Example 5.1.38 If n = 4, r = 2, and [1, .6, 0, .8]7 and [9, 1, .8, .6]7 ere the 
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Householder vectors associated with ( and Q3 respectively, then 


-1 1080 
T -6 -32|r1 6 o s 
QiQ = LAWY =+] y ay [3 1 8 HE 
-8 4 


5.1.8 Givens Rotations 


Householder reflections are exceedingly useful for introducing zeros on a 
grand scale, e.g., the annihilation of all but the first component of a vec- 
tor. However, in calculations where it is necessary to zero elements more 
selectively, Givens retations are the transformation of choice. These are 
rank-two corrections to the identity of the form 


1 0 0 .. 0 
Ü c 8 0 Li 
G(ik 8) = : Do d : (5.1.7) 
0 "S —8 "» c- 0 k 
QO e 0 -e 0 l 
i k 


where c = cos(0) and s = sin(@) for some 8. Givens rotations are clearly 
orthogonal. 

Premultiplication by G(t, k, 8)7 amounts to a counterclockwise rotation 
of 8 radians in the (i,k) coordinate plane. Indeed, if r € R" and y = 
G(i,k,8) x, then — 


CR — sm, j=i 
y dme, j-k 
rj jH#ik 
From these formulae it is clear that we can force yy to be zero by setting 


Tai —Ik 
c= ——— eS (5.1.8) 
MEER! Vit t af 
Thus, it is a simple matter to zero a specified entry in a vector by using a 
Givens rotation. In practice, there are better ways to compute c and s than 
(5.1.8). The following algorithm, for example, guards against overflow. 
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Algorithm 5.1.3 Given scalars a and b, this function computes c = cos(8) 


0 prt 


function: [c, s] = givens(a, 5) 


ifb=0 
c=1;3s=0 
else 
if |b] > |a| 
r= a/b s lI Th c= sr 
else 
r= —bfa; c= lvl r5 ser 
end 
end 


This algorithm requires 5 flops and a single square root. Note that it does 
not compute @ and so it does not involve inverse trigonometric functions. 


Example 5.1.4 Ifz = (1, 2, 3, 4|T, coe(8) = 1/V5, and sin() = -2/v5, then 
G(2,4,0)z = [1, v20, 3, oJT. 


5.1.9 Applying Givens Rotations 


It is critical that the simple structure of a Givens rotation matrix be ex- 
ploited when it is involved in a matrix multiplication. Suppose A € R”, 
c = cos(ĝ), and s = sin(8). If G(i, k,8) € IR"*", then the update A — 
G(i, k, 8)" A effects just two rows of A, 


-s € 


T 
Atti) = | e :] A(fí B) 
and requires just 6n flops: 


for j — im 
n= Ali, j) 
T2 = A(k, j) 


A(2, j) = s7, + er 
end 


Likewise, if G(i, k,8) e IR?*", then the update A + AG(i, k, 8) effects just 
two columns of A, 


AGIR) = AG p | E 2] 


-3 C 


and requires just 6m flops: 
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for j= lm 
71 = A(j, i) 
T3 = AQ, k) 
A(j, i) = en — 575 
A(j, k} = aT, dcm 
end 


5.1.10 Roundoff Properties 


The numerical properties of Givens rotations are as favorable as those for 
Householder reflections. In particular, it can be shown that the computed 
é and 3 in givens satisfy 

c(l + e) c O(u) 

s(1 + €,) és O(u). 

If 6 and å are subsequently used in a Givens update, then the computed 
update i3 the exact update of & nearby matrix: 


FG, k, 8)7 A] G(,k8) (AE) — (Ell = ull A lla 


Wow 
oi 


Cor 0» 


li 


FAG, k, 8)] (A+ E)G(i, k, 0) 


À detailed error analysis of Givens rotations may be found in Wilkinson 
(1965, pp. 131-39). 


| E lla = ull A Ile. 


5.1.11 Representing Products of Givens Rotations 


Suppose Q = G,--- G; is a product of Givens rotations. As we have seen in 
connection with Householder reflections, it is more economical to keep the 
orthogonal matrix Q in factored form than to compute explicitly the prod- 
uct of the rotations. Using a technique demonstrated by Stewart (1976), 
it ia possible to do this in a very compact way. The idea is to associate a 
single floating point number p with each rotation. Specifically, if 


z=] c :] +s? =1 


-s € 
then we define the scalar p by 
ifc=0 
p=1 
elseif |s| < |e] 
p = sign(c)s/2 (5.1.9) 


p = 2sign(s)/c 


end 
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Essentially, this amounts to storing 3/2 if the sine is smaller and 2/c if the 
cosine is smaller. With this encoding, it is possible to reconstruct +Z as 
follows: 


ifp=1 
e=0; s=1 
elseif |p| < 1 
s=2p;c=V1-—<s (5.1.10) 
else 
c22/p s- V1-ci 
end 


That -Z may be generated is usually of no consequence for if Z zeros a 
particular matrix entry, so does —Z. The reason for essentially storing the 
smaller of c and s is that the formula yI — z? renders poor results if x is 
near unity. More details may be found in Stewart (1976). Of course, to 
"reconstruct" G(i,k,8| we need i and k in addition to the associated p. 
This usually poses no difficulty as we discuss in $5.2.3. 


5.1.12 Error Propagation 


We offer some remarks about the propagation of roundoff error in algo- 
rithms that involve sequences of Householder/Givens updates. To be pre- 
cise, suppose A = Ap € IR™*" is given and that matrices 4,,...,d4, = B 
are generated via the formula 


Ag = FUQeAg-1Z~) — k— Lp. 


Assume that the above Householder and Givens algorithms are used for 
both the generation and application of the Q, and Zy . Let Qg and Zy be 
the orthogonal matrices that would be produced in the absence of roundoff. 
It can be shown that 


B = (Qp QA + B)(Z1--- Zp), (5.1.11) 


where || E ||; < cull A 2 and c is a constant that depends mildly on n, m, 
and p. In plain English, B is an exact orthogonal update of a matrix near 
to A. 


5.1.13 Fast Givens Transformations 


The ability to introduce zeros in a selective fashion makes Givens rotations 
an important zeroing tool in certain structured problems. This has led to 
the development of “fast Givens” procedures. The fast Givens idea amounts 
to a clever representation of Q when Q is the product of Givens rotations. 


5.1. HOUSEHOLDER AND GIVENS MATRICES 219 


In particular, Q is represented by a matrix pair (M, D) where MTM = D = 
diag(d;) and each d; is positive. The matrices Q, M, and D are connected 
through the formula 


Q = MD"? = Mdieg(1/ V/ d). 


Note that (M D-U2)T(MD-1/7 = D-V?pD-V? = I and so the ma 
trix MD-'/2 is orthogonal. Moreover, if F is an n-by-n matrix with 
FT DF = Dnew diagonal, then MZ, Mnew = Drew where Mj, = MF. 
Thus, it is possible to update the fast Givens representation (M, D) to ob- 
tain (Mnew, Dnew)- For this idea to be of practical interest, we must show 
how to give F' zeroing capabilities subject to the constraint that it "keeps" 
D diagonal. 

The details are best explained at the 2-by-2 level. Let z = [zi z3]7 and 
D = diag(d;, dz) be given and assume that dı and dz are positive. Define 


.15 1 
M = | Low (5.1.12) 
and observe that 
Mir = Ayr, + 2a 
1 Ti + at? 
and d4 + 82d dif +d 
T _ a+ d: id + doa, | _ 
Mi DM, = | d), +da; d + obda | = D. 
If rg #0, ay = —z1/z2, and fy = —aydo/d,, then 
zall + 
Mls = | al i ud 
T _ | da(14 m) 0 
Mj DM, = | 0 dhilt) 
where y, = -œf = (da/di)(31/z3)*. 
Analogously, if we assume +, Æ 0 and define My by 
_ l o 
M = B 1 | (5.1.13) 


where a2 = —z2/r, and fj; = —(d/d2)a2, then 
Miz = heres | 


and 


[düu*T1) 0 _ 
Mi DM3 = | 1 0 à 9) | = Ds, 
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where y3 = ~f = (di/d2)(z2/21)*. 

It is easy to show that for either i = 1 or 2, the matrix J = DMD” 
is orthogonal and that it is designed so that the second component of 
JT(D^W/?5) is zero. (J may actually be a reflection and thus it is half 
correct to use the popular term "fast Givens.") 

Notice that the y; satisfy yy = 1. Thus, we can always select M; in 
the above so that the “growth factor” (1-- y;) is bounded by 2. Matrices 


of the form 
E Bi 1 _ 1 a 
i| ola T] 


that satisfy —1 < a; < 0 are 2-by-2 fast Givens transformations. Notice 
that premultiplication by a fast Givens transformation involves half the 
number of multiplies as premultiplication by an "ordinary" Givens trans- 
formation. Also, the zeroing is carried out without an explicit square root. 

In the n-by-n case, everything “scales up” as with ordinary Givens ro- 
tations. The "type 1” transformations have the form 


J... QO --- O .-. 0 
Ô a’ B e. 1 es 0 1 
F(i,k,a,8) = : Do8 [| : (5.1.14) 
0 > Tee, qae 0 k 
0 ~ 0... 0 - t 


i 
while the “type 2” transformations are structured as follows: 


1... 0.. 0... 0 
0 1 a 0 t 
F(i, k, a, fj) = : : ES : : (5.1.15) 
QO ee B I 0 k 
0 0 0 1 
k 


Encapsulating ali this we obtain 


Algorithm 5.1.4 Given r € R? and positive d € RŽ, the following al- 
gorithm computes a 2-by-2 fast Givens transformation M such that the 
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second component of M7 z is zero and M7 DM = D, is diagonal where D 
= diag(d,, dz). If type = 1 then M has the form (5.1.12) while if type = 2 
then M has the form (5.1.13). The diagonal elements of D; overwrite d. 


function: [a, 8, type ] = fast.givens(z, d) 


if x(2) #0 
a = —z(1)/z(2); B = -ad(2)/d(1); y = -aß 
ify<1 
type = 1 
r = d(1); d(1) = (1 + y)d(2); d(2) = (1+)r 
type = 2 
a —l/a; B =1/8; y= 1/7 
d(1) = (1 + y)a(1); d(2) = (1 + y)d(2) 
end 
else 
type = 2 
a=0;4=0 
end 


The application of fast Givens transformations is analogous to that for 
ordinary Givens transformations. Even with the appropriate type of trans- 
formation used, the growth factor 1 + "y may still be as large as two. Thus, 
2* growth can occur in the entries of D and M after s updates. This means 
that the diagonal D must be monitored during & fast Givens procedure to 
avoid overflow. See Ánda and Park (1994) for how to do this efficiently. 

Nevertheless, element growth in M and D is controlled because at all 
times we have M D-1? orthogonal. The roundoff properties of a fast givens 
procedure are what we would expect of a Givens matrix technique. For ex- 
ample, if we computed Q = fM D-1/?) where M and D are the computed 
M and D, then Q is orthogonal to working precision: ll QTQ-I lla = u. 


Problems 


P5.1.1 Execute house with z = (1, 7, 2, 3, -1]T. 

P5.1.2 Let z and y be nonzero vectors in R". Give an algorithm for determining ^ 
Householder matrix P such that Pr is a multiple of y. 

P5.1.3 Suppose z € C^ and that zi = [zije' with 9ER. Assume z # 0 and 
define u = re" r|ae: Show that P = I — Zuuf/uPwu is unitary and that 
Pz = -e || z lae. 

P5.1.4 Use Househokler matrices to show that det(/ + zyT) = 1+27y where z and y 
are given n-vectors. 


P5.1.5 Suppose r € C?. Give an algorithm for determining & unitary matrix of the 
form 


Q= [5 :] c€R, deu =1 


such that the second component of QF x is zero. 
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P5.1.6 Suppose r and y are unit vectors in R”, Give an algorithm using Givens 
transformations which computes an orthogonal Q such that QT z = y. 
P5.1.7 Determine c = cos(#} and s = sin(8) such that 


[eE] 


P5.1.8 Suppose that Q = 7 + YTYT is orthogonal where Y € R'**? and T € Fj 7 is 
upper triangular. Show that if Q} = QP where P = I — 20v VPN is a Householder 
matrix, then Q4 can be expressed in the form Q4 = JE YXT4YT where Y. € Roxit) 
and T, € ED *U*U is upper triangular. 

P5.1.9 Give a detailed implementation of Algorithm 5.1.2 with the assumption that 
vU (3-1), the essential part of the the jth Householder vector, is stored in A(j4- L:n, j). 
Since Y is effectively represented in A, your procedure need only set up the W matrix. 
P$.1.10 Show that if S is skew-symmetric (ST = —S), then Q = (f+ S) - S)7! is 
orthogonal (Q is called the Cayley transform of S.) Construct a rank-2 S so that if z 
is a vector then Qz is zero except in the first component. 


P5.1.11 Suppose P c F^*" satisfies | PT P — In ||, = € « 1. Show that all the singular 
values of P are in the interval [1 — e, 1 +e and that || P — UVT ||, < «where P = UXVT 
is the SVD of P. 


P5.1.12 Suppose A € E2*?, Under what conditions is the closest rotation to A closer 
than the closest reflection to A? 


Notes and References for Sec. 5.1 
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C.H. Bischof and C. Van Loan (1987). “The WY Representation for Products of House- 
holder Matrices,” SIAM J. Sci. and Stat. Comp. 8, 82-s13. 
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SIAM J. Numer. Anal 25, 189-205. 
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Givens rotations, named after W. Givens, are also referred to as Jacobi rotations. Jacobi 
devised a symmetric eigenvalue algorithm based on these transformations in 1846. See 
88.4. The Givens rotation storage scheme discussed in the text is detailed in 


G.W. Stewart (1076). “The Economical Storage of Plane Rotations,” Numer. Math. 
25, 137-38. 


Fast Givens transformations are also referred to as "square-root-free" Givens tranafor- 
mations. (Recall that a square root must ordinarily be computed during the formation 
of Givens transformation.) There are severa] ways fast Givens calculations can be ar- 
ranged. See 


M. Gentleman (1973). "Least Squares Computations by Givens Transformations without 
Square Roots," J. inst. Math. Appl 12, 329-36. 
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5.2 The QR Factorization 


We now show how Householder and Givens transformations can be used to 
compute various factorizations, beginning with the QR factorization. The 
QR factorization of an m-by-n matrix A is given by 


A-QR 


where Q € R™*™ is orthogonal and R c R™*" is upper triangular. In this 
section we assume m > n. We will see that if A has full column rank, 
then the first n columns of Q form an orthonormal basis for ran( A). Thus, 
calculation of the QR factorization is one way to compute an orthonormal 
basis for a set of vectors. This computation can be arranged in several ways. 
We give methods based on Householder, block Householder, Givens, and 
fast Givens transformations. The Gram-Schmidt orthogonalization process 
and a numerically more stable variant called modified Gram-Schmidt are 
also discussed. 
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5.2.1 Householder QR 


We begin with a QR factorization method that utilizes Householder trans- 
formations. The essence of the algorithm can be conveyed by a small ex- 
ample. Suppose m = 6, n = 5, and assume that Householder matrices Hi 
and H4 have been computed so that 


x 


HHA = 


oo ooo 
oooo x Xx 
x X X X Xx 
X X X XXX 


Concentrating on the highlighted entries, we determine a Householder ma 


trix Hy € RY such that 


a x 

E a 0 

Hs) a| =! 0 

a G 

If H; = diag(/, H3), then 

X X X X X 
Ü x x Xx x 
0 0 x x x 
HMA = | 9 g 9 x x 
0 0 O x x 
0.0 0 x x 


After n such steps we obtain an upper triangular H4H,.,--- A= R and 
so by setting Q = Hı ---H,, we obtain A = QR. 


Algorithm 5.2.1 (Householder QR) Given A € R™*" with m > n, 
the following algorithm finds Householder matrices H,,...,H, such that if 
Q = H,---H,, then QT A = R is upper triangular. The upper triangular 
part of A is overwritten by the upper triangular part of R and components 
j+1:m of the jth Householder vector are stored in A(j + Lim, j), j < m. 


for j —-Ln 
[v, A] = house(A(j:m, j)) 
A(j:m, jin) = (Im 441 — BvvT ) Aj: m, jin) 
ifj<m 
A(j + lim, j) = v(2:m — j + 1) 
end 
end 
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This algorithm requires 2n7(m — n/3) flops. 
To clarify how A is overwtitten, if 


v? = [0,...,0, 1,00? 
d 


fae UD) rm 


j-1 


is the jth Householder vector, then upon completion 


Ta T7132 ë fia Ffu Tis 
Ug T732 T233 Ta 7325 


Uy Uy Ta3 Tru 35 


If the matrix Q = Hy ---H,, is required, then it can be accumulated using 
(5.1.5). This accumulation requires 4(m?n — mn? + 13/3) flops. 

The computed upper triangular matrix Ê is the exact R for a nearby A 
in the sense that Z7T(A 4- E) = Ê where Z is some exact orthogonal matrix 
and || E la = ull A liz- 


5.2.2 Block Householder QR Factorization 


Algorithm 5.2.1 is rich in the level-2 operations of matrix-vector multi- 
plication and outer product updates, By reorganizing the computation 
and using the block Householder representation discussed in §5.1.7 we can 
obtain a level-3 procedure. The idea is to apply clusters of Householder 
transformations that are represented in the WY form of §5.1.7. 

A small example illustrates the main idea. Suppose n = 12 and that 
the “blocking parameter” r has the value r = 3. The first step is to gener- 
ate Householders H,, 43, and H4 as in Algorithm 5.2.1. However, unlike 
Algorithm 5.2.1 where the H; are applied to all of A, we only apply Hi, 
Hz, and H3 to A(:, 1:3). After this is accomplished we generate the block 
representation Hı H2H3 = I + WYT and then perform the level-3 update 


A(,412) = (I + WYT)A(, 412). 
Next, we generate H4, Hs, and Hg as in Algorithm 5.2.1. However, these 


transformations are not applied to A(:, 7:12) until their block representation 
HH Hg = I + W4Y7 is found. This illustrates the general pattern. 
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A=1,k=0 
while À <n 


r=min(A+r—Ln); k=k+1 

Using Algorithm 5.2.1, upper triangularize A(A:m, A:n) 
generating Householder matrices Hy, ..., Hy- (5.2.1) 

Use Algorithm 5.1.2 to get the block representation 
I+ WyY, = H,..., H,. 

Alim, T + in) = (+ WYP) Am, r + En) 

A=T+1 

end 


The zero-nonzero structure of the Householder vectors that define the ma- 
trices £3,..., H, implies that the first à — 1 rows of Wy and Y, are zero. 
This fact would be exploited in 4 practical implementation. 

The proper way to regard (5.2.1) is through the partitioning 


A = jn... An] N = ceil(n/r) 


where block column A, is processed during the kth step. In the kth step of 
(5.2.1), a block Householder is formed that zeros the subdiagonai portion 
of Ay. The remaining block columna are then updated. 

The roundoff properties of (5.2.1) are essentially the same as those for 
Algorithm 5.2.1. There is a slight increase in the number of flops required 
because of the W-matrix computations. However, as a result of the block- 
ing, all but a small fraction of the flops occur in the context of matrix mul- 
tiplication. In particular, the level-3 fraction of (5.2.1) is approximately 
1- 2/N. See Bischof and Van Loan (1987) for further details. 


5.2. Givens QR Methods 


Givens rotations can also be used to compute the QR factorization. The 
4-by-3 case illustrates the genera] idea: 


X X X x x x x x x 

x X X L(30|x x X|(23|Xx X x |(12 
—— — 

x x X X x x Ò x x 

x x X 0 x x 0 x x 

X x x X X X X X X 

Oo x x G4) Gx x (24) 0 x x (34) R 

ao x x U x x 0 0 x 

Ü x x 0 0 x 0 O0 x 


Here we have highlighted the 2-vectors that define the underlying Givens 
rotations. Clearly, if G; denotes the jth Givens rotation in the reduction, 
then QT A = R is upper triangular where Q = G,---G, and t is the total 
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number of rotations. For general m and n we have: 


Algorithm 5.2.2 (Givens QR) Given A c R”*" with m > n, the fol- 
lowing algorithm overwrites A with QTA = R, where R is upper triangular 
and Q is orthogonai. 


for j = Ln 
fori-m:-ijci 
le; s] = givens(A(i — 1, j), AG, J) 
: - c s - 2l 
A(i — Li, jn) = | -s c | Ali — Lii, jin) 
end 
end 


This algorithm requires 3n?(m — n/3) flops. Note that we could use (5.1.9) 
to encode (c, s) in a single number p which could then be stored in the zeroed 
entry A(i,j). An operation such as z + Q7 z could then be implemented 
by using (5.1.10), taking care to reconstruct the rotations in the proper 
order. 

Other sequences of rotations can be used to upper triangularize A. For 
example, if we replace the for statements in Algorithm 5.2.2 with 


for i= m: — 1:2 
for j = l:min{i — 1, n} 


then the zeros in A are introduced row-by-row. 

Another parameter in a Givens QR procedure concerns the planes of 
rotation that are involved in the zeroing of each a;;. For example, instead 
of rotating rows i — 1 and i to zero a,; as in Algorithm 5.2.2, we could use 
rows j and i: 


for j= i:n 
for i=m:-1:j+1 
[e, s] = givens(AC, j), A(t, 7) 
Alli i], in) = | 4 ‘| A( i], im) 
end 
end 


5.2.4 Hessenberg QR via Givens 


As an example of how Givens rotations can be used in structured problems, 
we show how they can be employed to compute the QR factorization of an 
upper Hessenberg matrix. A small example illustrates the general idea. 
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Suppose n — 6 and that after two steps we have computed 


X X X 


G(2, 3,02)" G(1,2,0,)7A = 


eooooc x 
ooco x 
oo x XK Xx 
OX X X X 
X X X XXX 
X X X XX 


x 


We then compute G(3, 4, #3) to zero the current (4,3) entry thereby obtain- 
ing 


G(3, 4,03)7G(2,3,02) G(1,2,8,)7 A = 


cooocox 
oo cocoa kK K 
ooo x xXx 
ox KK XxX XK 
x XK KR HK 
X X xX xX XX 


Overall we have 


Algorithm 5.2.3 (Hessenberg QR) If A € IR"*" is upper Hessenberg, 
then the following algorithm overwrites A with QTA = R where Q is or- 
thogonai and R is upper triangular. Q = G; ---G,_, is a product of Givens 
rotations where G; has the form G; = G(j,j + 1,8). 


for j =l:n-1 
[e 5] = givens( A, 7), AG + 1,3)) 
A(jj + Lin) = | 4 :| Ajj +l in) 
end 
This algorithm requires about 3n? flops. 


5.2.5 Fast Givens QR. 


We can use the fast Givens transformations described in 85.1.13 to compute 
an (M, D) representation of Q. In particular, if M is nonsingular and D 
is diagonal such that MT A = T is upper triangular and MT M = D is 
diagonal, then Q = A£D-7!/7 is orthogonal and QTA = D-V?T = Ris 
upper triangular. Analogous to the Givens QR procedure we have: 


Algorithm 5.2.4 (Fast Givens QR) Given A € R”™” with m > n, the 
following algorithm computes nonsingular M € IR"*"" and positive d(1:m) 
such that MTA = T is upper triangular, and MT M = diag(d,,...,dm). A 
is overwritten by T. Note: A = (MD-/T)( D'/?T) is a QR factorization 
of A. 
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for i = 1: n 
d(i) =1 

end 

for j = ln 


for i=m:—1:7+1 
[a, B, type] = fast.givens(A(i — 1:2, j), d(i — 1:1) 
if type = 1 


T 
A(t — bi, in) = | 21 | AG — 1, jin) 
else 


1 a 


A(t — 14, jin) = | 81 


| 4e — 13, j:n) 


end 
end 


This algorithm requires 2n?(m — n/3) flops. As we mentioned in the pre- 
vious section, it is necessary to guard against overflow in fast Givens algo- 
rithms such as the above. This means that M, D, and A must be periodi- 
cally scaled if their entries become large. 

If the QR factorization of a narrow band matrix is required, then the 
fast Givens approach is attractive because it, involves no square roots. (We 
found LDLT preferable to Cholesky in the narrow band case for the same 
reason; see 84.3.5.) In particular, if A € K'"*" has upper bandwidth q and 
lower bandwidth p, then QT A = R has upper bandwidth p+ gq. In this 
case Givens QR requires about O(np(p + q)) flops and O(np) square roots. 
Thus, the square roots are a significant portion of the overall computation 
if p, q «& n. 


5.2.6 Properties of the QR Factorization 


The above algorithms “prove” that the QR factorization exists. Now we 
relate the columns of Q to ran( A) and ran( A). and examine the uniqueness 
question. 


Theorem 5.2.1 If A = QR is a QR factorization of a full column rank 
À € R7*" and A = [ai,...,a4] and Q = 1qi,..., qm ] are column parti- 
ttonings, then 


span(a,...,ay) = span{g,..-,¢x} k=ln. 
In particular, if Q1 = Q(1:m, i:n) and Q} = Q(1:m,n + 1:m) then 


ran(A) ran(Q1) 
ran(A}* = ran(Qz) 


and A = Q R with Ry = R(t, Ln). 
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Proof. Comparing kth columns in A = QR we conclude that 


k 
ük = So rina € span(qgi....dx) . (5.2.2) 
i=l 
Thus, span(ai,...,a&) C span{qi,... ge} However, since rank(A) = 
n it follows that span{a,,...,0,} has dimension k and so must equal 
span(gi,...,qx) The rest of the theorem follows trivially. O 


The matrices Q1 = Q(1:m, lin) and Q; = Q(1:m,n + 1:m) can be easily 
computed from a factored form representation of Q. 

IF A = QR is a QR factorization of A € R™*" and m > n, then we refer 
to A = Q(:, 1:n) R(1:n, 1:n) as the thin QR factorization. The next result 
addresses the uniqueness issue for the thin QR factorization 


Theorem 5.2.2 Suppose A € R™*" has full column rank. The thin QR 
factorization 

A-Qif 
is unique where (Q1 € IR" *" has orthonormal columns and Ra is upper tri- 
angular with positive diagonal entries, Moreover, Ri = GT where G is the 
lower triangular Cholesky factor of AT A. 


Proof. Since ATA = (Q Ri)" (QLR,) = RT R; we see that G = RT is the 
Cholesky factor of AT A. This factor is unique by Theorem 4.2.5. Since 
Qi = ARĪ! it follows that Q; is also unique. O 


How are Qi and Ri affected by perturbations in A? To answer this 
question we need to extend the notion of condition to rectangular matrices. 
Recall from 82.7.3 that the 2-norm condition of a square nonsingular matrix 
is the ratio of the largest and smallest singular values. For rectanguiar 
matrices with full column rank we continue with this definition: 


A € R™" rank(A) =n = sa(A) = eri 


If the columns of A are nearly dependent, then «2(A) is large. Stewart 
(1993) has shown that O(e) relative error in A induces O(enz;( ÀA)) relative 
error in R and Qj. 


5.2.7 Classical Gram-Schmidt 


We now discuss two alternative methods that can be used to compute the 
thin QR factorization A = Qı Rı directly. If rank(A) = n, then equation 
(5.2.2) can be solved for qx: 


k-1 
dk — (a - Soran) Jra 
imi 
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Thus, we can think of qg as a unit 2-norm vector in the direction of 
k-1 
Zk = ay — Y rigi 
i=l 
where to ensure zy € span(q1,...,qk-i1) ^ we choose 
Te = ga di-lk-l. 
This leads to the classical Gram-Schmidt (CGS) algorithm for computing 
A-QjIA. 
R(1,1) = ll A(:,1) ily 
QU,1) = AC, 1)/R(1, 1) 


for k = 2:n 
R(l:k — 1,4) = Q(1:m, Lk — 1)7 A(1:m, k) 
z = A(lim,k) - Q(1:m, 1:k — 1) R(Ek - 1,4) (5.2.3) 
Rk, k) = llla 
Q(1:m, k) = z/ R(k, k) 
end 


In the kth step of CGS, the kth columns of both Q and R are generated. 


5.2.8 Modified Gram-Schmidt 


Unfortunately, the CGS method has very poor numerical properties in that 
there is typically a severe loss of orthogonality among the computed qi- 
Interestingly, a rearrangement of the calculation, known as modified Gram- 
Schmidt (MGS), yields a much sounder computational procedure. In the 
kth step of MGS, the kth column of Q (denoted by gx) and the kth row of 
R (denoted by rZ) are determined. To derive the MGS method, define the 
matrix AC) c I™* (n—k+i) by 


k—i A 
A-S ar = Sar? = [0 40]. (5.2.4) 
i=l imk 
It follows that if 
AC = [z B) 
1 n-k 
then rj = Hila, de = z/rkk and (rk k+i'* Tkn) = LB. We then 


compute the outer product At? = B — qt (ri 441. ry) and proceed 
to the next step. This completely describes the kth step of MGS. 


Algorithm 5.2.5 (Modified Gram-Schmidt) Given A € R™*" with 
rank(A) = n, the following algorithm computes the factorization A = Q, Bı 
where Qi € R™*" has orthonormal columns and R; € R"*" is upper tri- 
angular. 
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for k = lin 
Rk, k) = || A(1:m, k) lia 
Q(1: rn, k) = A(1:m, k)/ R(k, k} 
for j - k-- En 
R(k, j) = Q(I:m, k)? A(1:m, j) 
d Allm, j) = A(1:m, j) = Qm, k)(k, 5) 
end 


This algorithm requires 2mn? flops. It ia not possible to overwrite A with 
both Q; and AZ. Typically, the MGS computation is arranged so that A is 
overwritten by Qı and the matrix A, is stored in a separate array. 


5.2.8 Work and Accuracy 


If one is interested in computing an orthonormal basis for ran(A), then 
the Householder approach requires 2mn? — 2n3/3 flops to get Q in fac- 
tored form and another 2mn? — 2n?/3 flopa to get the first n columns of 
Q. (This requires “paying attention" to just the first n columna of Q in 
(5.1.5).) Therefore, for the problem of finding an orthonormal basis for 
ran(A), MGS is about twice as efficient as Householder orthogonalization. 
However, Björck (1967) has shown that MGS produces a computed Qi = 
[di .-. da ] that satisfies 


QTQ, = I + Emas — | Enos lla  ux;(A) 


whereas the corresponding result for the Householder approach is of the 
form 
QQ = I + En |lExle=u. 


Thus, if orthonormality is critical, then MGS should be used to compute 
orthonormal bases only when the vectors to be orthogonalized are fairly 
independent. 

We also mention that the computed triangular factor R produced by 
MGS satisfies || A — QR j| = uli A || and that there exists a Q with perfectly 
orthonormal columns such that || À — QE || = uj] A ||. See Higham (1996, 
p.379). 


Example 5.2.1 If modified Gram-Schmidt is applied to 


l 1 
Az | 103 0 xg(A) 25 14. 10? 
0 103 
with 6-digit decimal arithmetic, then 
1.00000 o | 


[h Ge] = | 001 —.707107 
0 .T07100 
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5.2.10 A Note on Complex QR. 


Most of the algorithms that we present in this book have complex ver- 
sions that are fairly straight forward to derive from their real counterparts. 
(This is not to say that everything is easy and obvious at the implementa- 
tion level.) As an illustration we outline what a complex Householder QR 
factorization algorithm looks like. 

Starting at the level of an individual Householder transformation, sup- 
pose 0 Æ z € C” and that zi = re” where r, € R. v = zx ell x lae 
and P = I, — Buv”, B = 2/vP v, then Pr = Fe} z||ge1. (See P5.1.3.) 
The sign can be determined to maximize || v ||; for the sake of stability. 

The upper triangularization of A € R™*", m > n, proceeds as in Algo- 
rithm 5.2.1. In step j we zero the subdiagonal portion of A(j:m, J): 


for j= lm 

z = A(j:m, j) 

v = z + e”|| z lee; where zı = re”. 

B — 2/vP [v 

A(j:m, jin) = (In-j«1 — Bue") AG:m, jin) 
end 


The reduction involves 8n?(m — n/3) real flops, four times the number 
required to execute Algorithm 5.2.1. If Q = P,--- Pa is the product of the 
Householder transformations, then Q is unitary and Q7 A = R € IR? *" is 
complex and upper triangular. 


Problema 


P5.2.1 Adapt the Householder QR algorithm so that it can efficiently handle the case 
when A € R™*° has lower bandwidth p and upper bandwidth g. 


P5.2.2 Adapt the Householder QR aigorithm so that it computes the factorization 
A = QL where L is lower triangular and Q is orthogonal. Assume that A is square. This 
involves rewriting the Householder vector function v ~ house(z) so that (1—2vv7 fu? v)z 
is zero everywhere but its bottom component. 


P5.2.3 Adapt the Givens QR factorization algorithm so that the zeros are introduced by 
diagonal. That is, the entries are zeroed in the order (m, 1), (m — 1, 1), (m, 2), (m— 2, 1), 
(m — 1,2), (m, 3) , etc. 

P5.2.4 Adapt the fast Givens QR factorization algorithm ao that it efficiently handlea 
the case when A is n-by-n and tridiagonal. Assume that the subdiagonal, diagonal, and 
superdiagonal of A are stored in e(1:n — 1), a(1:n), f(1:n — 1) respectively. Design your 
algorithm so that these vectors are overwritten by the nonzero portion of T. 


P5.2.5 Suppose L € R™*" with m > n is lower triangular. Show how Housebolder 
matrices Hi... Hn can be used to determine a lower triangular Li € RC" so that 


Hac HL = | ta | 


234 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES 


Hint: The second step in the 6-by-3 case involves finding Hz so that 
0 


Ha 


xX XK MK 
XXXXX 
oooxoo 
i 
XXXXXX 
Oocoxxo 
cocooxcoco 


with the property that rows 1 And 3 are left alone. 
P5.2.6 Show that if 


How k c k 
A= [5 pn è= [1] m-k 
k 


n ~ 


and A has full column rank, then min | Az -b l? = lH dW2 = (vTd/l vila)’. 


P5.2.7 Suppose A € KC *^ and D = diag(di,...,d4) € E'*". Show how to construct 
an orthogonal Q such that QT A — DQ7 = R is upper triangular. Do not worry about 
efficiency-—this is just an exercise in QR manipulation. 
P5.2.8 Show how to compute the QR factorization of the product A = A, .-- AzA 
without explicitly multiplying the matrices A,,...,Ap together. Hint: In the p = 
3 case, write QJ A = QT AsQaGT AsQiQT Ai and determine orthogonal Q; so that 
QT (AsQc—1) in upper triangular. (Qo = J). 
P5.2.9 Suppose Ac R**" and let E be the permutation obtained by reversing the 
order of the rows in In. (This is just the exchange matrix of 54.7.) (a) Show that if 
Re R*" is upper triangular, then L = ERE is lower triangular. (b) Show how to 
compute an orthogonal Q € R"*® and a lower triangular Lc RO" so that A = QL 
assuming the availability of a procedure for computing the QR factorization. 
P5.2.10 MGS applied to A € R™** is numerically equivalent to the first step in House- 
holder QR applied to 

- On 

a=[%] 


where O, is the n-by-n zero matrix. Verify that this statement is true after the first 
step of each method is completed. 

PS.2.11 Reverse tha loop orders in Algorithm 5.2.5 (MGS QR) so that R is computed 
column-by-column. 

P5.2.12 Develop a complex version of the Givens QR factorization. Refer to P5.1.5. 
where complex Givens rotations are the thema. Is it possible to organize the calculations 
so that the diagonal elements of Ft are nonnegative? 


Notes and References for Sec. 5.2 
The idea of using Householder transformations to solve the LS problem wes proposed in 


A.S. Householder (1958). “Unitary Triangularization of a Nonsymmetric Matrix,” J. 
ACM. 3, 330-42. 


The practica] details were worked out in 


P. Businger and G.H. Golub (1965). “Linear Least Squares Solutions by Householder 
Transformations," Numer. Math 7, 269-76. See also Wilkinson and Reinsch 
(1971,111-18). 
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Numer, Math. 7, 206-16. 
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365-97. 
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T.F. Coleman and D.C. Sorensen (1984). "A Note on the Computation of an Orthonor- 
mai Basis for the Null Space of a Matrix," Mathematical Programming 29, 234-242. 
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J.R. Rice (1966). “Experiments on Gram-Schmidt Orthogonalization,” Math. Comp. 
20, 325-28. 

A. Bjórck (1967). "Solving Linear Least Squares Problems by Gram-Schmidt Orthogo- 
nalization," BIT 7, 1-21. 

N.N. Abdelmalek (1971). "Roundof Error Analysis for Gram-Schmidt Method and 
Solution of Linear Least Squares Problema," BIT 11, 345-68. 

J. Daniel, W.B. Gragg, L.Kaufman, and G.W. Stewart (1976). “Reorthogonalization 
and Stable Algorithms for Updating the Gram-Schmidt QR Factorization,” Math. 
Comp. 30, TT2-T95. 

A. Ruhe (1983). "Numerical Aspects of Gram-Schmidt Orthogonalization of Vectors,” 
Lin. Alg. and Ita Applic. 52/53, 591—601. 

W. Jalby and B. Philippe (1991). "Stability Analysis and Improvement of the Block 
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C.J. Demeure (1989). “Fast QR Factorization of Vandermonde Matrices,” Lin. Aig. 
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Various high-performance issues pertaining to the QR. factorization are discussed in 
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5.3 The Full Rank LS Problem 


Consider the problem of finding a vector z € R” such that Ar = b where 
the data matrix A € IR™*" and the observation vector b € R™ are given and 
m >n. When there are more equations than unknowns, we say that the 
system Ar = bis overdetermined. Usually an overdetermined system has 
no exact solution since b must be an element of ran(A), a proper subspace 
of R”. 

This suggests that we strive to minimize || Ar — b ||, for some suitable 
choice of p. Different norms render different optimum solutions. For exam- 
ple, i£ A = [1, 1, 1]? and b = (5, bz, 53]T. with bı > by > bg > O, then it 
can be verified that 


p = 1i => Io = be 
p = 2 => Zo = (b1+52+63)/3 
p = oo = Bom = (by +b3)/2. 


Minimization in the 1-norm and oo -norm is complicated by the fact that 
the function f(z) = || Az.— b||, is not differentiable for these values of 
P. However, much progress has been made in this area, and there are 
several good techniques available for 1-norm and oo-norm minimization. 
See Coleman and Li (1992), Li (1993), and Zhang (1993). 

In contrast to general p-norm minimization, the least squares (LS) prob- 
lem 


min || Ar —6 |, (5.3.1) 
zcR" 


is more tractable for two reasons: 


* (xz) = i| Ax -b ||} is a differentiable function of £ and so the min- 
imizers of ¢ satisfy the gradient equation Vó(z)- 0. This turns out 
to be an easily constructed symmetric linear system which is positive 
definite if A has full column rank. 
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* The 2-norm is preserved under orthogonal transformation. This means 
that we can seek an orthogonal Q such that the equivalent problem 
of minimizing || (Q7 A)z — (Q5) ||, is "easy" to solve. 


In this section we pursue these two solution approaches for the case when 
A has full column rank. Methods based on normal equations and the QR. 
factorization are detailed and compared. 


5.3.1 Implications of Full Rank 
Suppose x € R”, z € R” , and a € R and consider the equality 


Alr +az) -bl = || Az - 51d -2azT AT(Az - b) + a?|| Az Id 


where A € IR"*" and be R™. If x solves the LS problem (5.3.1) then 
we must have AT(Ar — b) = 0. Otherwise, if z = —AT(Az — b) and 
we make c small enough, then we obtain the contradictory inequality 
|| A(z + az) - bl; < || Az — 5||,. We may also conclude that if z and 
z az are LS minimizers, then z € null(A). 

Thus, if A has full column rank, then there is a unique LS solution rgs 
and it solves the symmetric positive definite linear system 


AT Axps = ATO. 


These are called the normal equations, Since V(r) = AT(Az — b) where 
é(z) = i| Az — b ||} , we see that solving the normal equations is tanta- 
mount to solving the gradient equation Vé = 0. We call 


TLs = b~ AtS 
the minimum residual and we use the notation 
prs = || Azzs — ll; 


to denote its size. Note that if pgs is smail, then we can “predict” 6 with 
the columns of A. 

So far we have been assuming that A € IR™*" has full column rank. 
'This assumption is dropped in $5.5. However, even if rank(A) — n, then 
we can expect trouble in the above procedures if A is nearly rank deficient. 

When assessing the quality of a computed LS solution Zrs, there are 
two important issues to bear in mind: 


è How close is zg to zrg? 


e How small is ?rs = b — AZzs compared to ris = b — Ázrs? 
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The relative importance of these two criteria varies from application to 
application. In any case it is important to understand how Trs and ris 
are affected by perturbations in A and b. Our intuition tells us that if 
the columns of A are nearly dependent, then these quantities may be quite 
sensitive. ! 


Example 5.3.1 Suppose 


1 0 0 0 1 0 
Az|0 1075 |,éA- | 0 0 Qb-2ljo[.esljol,. 
ü 0 n 19-8 1 0 


and that zrg and £p; minimize || Az — -tliz and || (A + 6A)z — (6+ 66) ||; respectively. 
Let rs and fzs be the corresponding minimum residuals. Then 


Ü 0 
1]. 1 : 3 
aus =[ )as-[ . ]s- 0 ; TLS = —.9999 - 10 . 

Ü 9999 . 10* 1 “9999 - 10° 
Since x3(AÀ)z 10° we have 

ll4is -zus lla = 9999. 10* < sA)? |6A la _ 103? . 107? 

ll zrs ila | Ad; 

and 


fLs- 5A 
Nfas— res lla. ro7g.10-2 < e) Alle = age 10-8, 
iè ig Ll All; 


The example suggests that the sensitivity of rzs depends upon «2(A)*. At 
the end of this section we develop a perturbation theory for the LS problem 
and the «2(A)? factor will return. 


5.3.2 The Method of Normal Equations 


The most widely used method for solving the full rank LS problem is the 
method of normal equations. 


Algorithm 5.3.1 (Normal Equations) Given A € IR?*" with the prop- 
erty that rank(A) = n and b € R™, this algorithm computes the solution 
Zps to the LS problem min || Az — b ||, where b € R™. 


Compute the lower triangular portion of C = AT A. 
d - ATb 

Compute the Cholesky factorization C = GGT. 
Solve Gy = d and GTazs = y. 


This algorithm requires {m + n/3)n? flops. The normal equation approach 
is convenient because it relies on standard algorithms: Cholesky factoriza- 
tion, matrix-matrix multiplication, and matrix-vector multiplication. The 
compression of the m-by-n data matrix A into the (typically) much smaller 
n-by-n cross-product matrix C is attractive. 
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Let us consider the accuracy of the computed normal equations solution 
års. For clarity, assume that no roundoff errors occur during the formation 
of C = AT A and d = ATb. (On many computers inner products are accu- 
mulated in double precision and so this is not a terribly unfair assumption.) 
It follows from what we know about the roundoff properties of the Cholesky 
factorization (cf. $4.2.7) that 


(ATA + Ers = ATb, 
where | E ||; = uff AT lall A ll; = ull ATA |], and thus we can expect 
Mas -aisla (ATA) = uml A}. (5.3.2) 
Il TLS lla 


In other words, the accuracy of the computed normal equations solution 
depends on the square of the condition. This seems to be consistent, with 
Example 5.3.1 but more refined comments follow in $5.3.9. 


Example 5.3.2 [t should be noted that the formation of AT A can result in a severe 
loes of information. 


1 1 2 
A= 10-3 0 and b = 107? 
0 1073 1073 


then x9(A) =Œ 14-105, zrs = [1 1]7, and pzs = 0. If the normal equations method is 
executed with base 10, t = & arithmetic, then a divide-by-zero occurs during the solution 
process, since 


1 1 
fUAT A) = | 11 ] 
is exactly singular. On the other hand, if 7-digit,srithmetic is used, then $55 = 
[2.000001 , 0|T and || zs — zrs [la/l zz. ilg = usz(A)*. 


5.3.3 LS Solution Via QR Factorization 


Let A € R™" with m 2 n and b € IR” be given and suppose that an 
orthogonal matrix Q € R™*™ has been computed such that 


QTA=R= [5| NA (5.3.3) 
is upper triangular. If 
c n 
q^ [ij m-n 


then 


I Az -bl} =|]Q7Ar— Qld =l Riz—cll +a 
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for any z c R”. Clearly, if rank(A) = rank(R,) = n, then zr is defined 
by the upper triangular system Rizrps =c. Note that 


pts = li d ll. 


We conclude that the full rank LS problem can be readily solved once we 
have computed the QR factorization of A. Details depend on the exact QR. 
procedure. If Householder matrices are used and Q7 is applied in factored 
form to b, then we obtain 


Algorithm 5.3.2 (Householder LS Solution) If A € IR"*" has fuil 
column rank and b € IR™, then the following algorithm computes a vector 
zig € R” such that || Arps — b ||, is minimum. 


Use Algorithm 5.2.1 to overwrite A with its QR factorization, 
for j = l:n 
v(j) 2 1; v(j + lim} = A(j + iim, j) 
bm) = Um-sit — Bev (jem) 
end 
Solve R(1:n, 1:n)zrs = b(1:n) using back substitution. 


This method for solving the full rank LS problem requires 2n?(m — n/3) 
flops. The O(mn) flops associated with the updating of b and the O(n?) 
flops associated with the back substitution are not significant compared to 
the work required to factor A. 

It can be shown that the computed 7,5 solves 


min|j (A + &A)z — (b 4- 68) Ila (5.3.4) 
where 
| 6A || p € (6m — 3n + 41)nul| A [p + Olu’) (5.3.5) 
and 
|| 85], < (6m — 3n + 40)nui| b |]; + Olu”). (5.3.6) 


These inequalities are established in Lawson and Hanson (1974, p.90ff) and 
show that frs satisfies a "nearby" LS problem. (We cannot address the 
relative error in ĉzg without an LS perturbation theory, to be discussed 
shortly.) We mention that similar resuits hold if Givens QR is used. 


5.3.4 Breakdown in Near-Rank Deficient Case 


Like the method of normal equations, the Houseliolder method for solving 
the LS problem breaks down in the back substitution phase if rank(A) « n. 
Numerically, trouble can be expected whenever x9(A) = xo(R) = 1/u. 
This is in contrast to the normal equations approach, where completion 
of the Cholesky factorization becomes problematical once K3(.4) is in the 
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neighborhood of 1/4/u. (See Example 5.3.2.) Hence the claim in Lawson 
and Hanson (1974, 126-127) that for a fixed machine precision, a wider 
class of LS problems can be solved using Householder orthogonalization. 


5.3.5 A Note on the MGS Approach 


In principle, MGS computes the thin QR factorization A = Qı Rı. This is 
enough to solve the full rank LS problem because it transforms the normal 
equations (AT A)z = ATb to the upper triangular system Rix = (Th. 
But an analysis of this approach when QTb is explicitly formed intro- 
duces a x2(A)? term. This is because the computed factor Q; satisfies 
{| QT Q1 — Ly |,  weg(A) as we mentioned in 55.2.9. 
However, if MGS is applied to the augmented matrix 


Ay =[46)=[0s esl [ 7 7]. 


then z = QTb. Computing QTb in this fashion and solving Rizps = z 
produces an LS solution 27s that is “just as good" as the Householder QR. 
method. That is to say, a result of the form (5.3.4)-(5.3.6) applies. See 
Björck and Paige (1992). 

It should be noted that the MGS method is slightly more expensive 
than Householder QR because it always manipulates m-vectors whereas 
the latter procedure deals with ever shorter vectors. 


5.3.6 Fast Givens LS Solver 


The LS problem can also be solved using fast Givens transformations. Sup- 
pose M^ M = D is diagonal and 


MTa= |9, 8 
0 m-n 
is upper triangular. If 
T, n c n 
Mb = H m-n 


then 


| Az -5]B =] D-!2MT(Az — 5) lf =|") ^ J--[i]) 


2 


2 


for any z € IR". Clearly, zrs is obtained by solving the nonsingular upper 
triangular system Sir = c. 

The computed solution $rs obtained in this fashion can be shown to 
solve a nearby LS problem in the sense of (5.3.4)-(5.3.6). This may seem 
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surprising since large numbers can arise during the calculation. An entry 
in the scaling matrix D can double in magnitude after a single fast Givens 
update. However, largeness in D must be exactly compensated for by large- 
ness in M, since DIIM is orthogonal at all stages of the computation. 
It is this phenomenon that, enables one to push through a favorable error 
analysis. 


5.3.7 The Sensitivity of the LS Problem 


We now develop a perturbation theory that assists in the comparison of 

the normal equations and QR approaches to the LS problem. The theorem 

below examines how the LS solution and its residual are affected by changes 

in A and b. In so doing, the condition of the LS problem is identified. 
Two easily established facts are required in the analysis: 


Allg (ATA) AT |, = r(A) 


(5.3.7) 
LAM (ATA); = r(A) 
These equations can be verified using the SVD. 
Theorem 5.3.1 Suppose z, r, =, and F satisfy 
| Ar 5|. = min r=b- Ar 
|| {A + 6A) — (b+ ôb) ||, = min fF = (b+ 6b) ~(A+4dA)2 


where A and 6A are in R™*" with m 7 n and 0 x b and ôb are in IR". If 


EL E IER 


Ald; keta ei( A) 
and PLS 
ua Tr! 

where prs = || Azrs — È lja, then 

l£ -z1i 2&3( A) 1 

= e aeu 
"iz HP { cog) + tan(O)ma(A) } + O(e) (5.3.8) 
Udo < (1+ 2&)(A)) min(1,m — n) + Of). (5.3.9) 
2 
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Proof. Let E and f be defined by E = 6A/e and f = 6b/e. By hypothesis 
| 6A lla < ex (A) and so by Theorem 2.5.2 we have rank(A + £E) = n for 
all t € (0, «]. Tt follows that the solution z(t) to 


(A--tEY (A --tE)z(t) = (A+tE)"(b tf) (5.3.10) 


is continuously differentiable for all t € [0, e]. Since z = z(0) and < = z(e), 
we have 


=z + (0) + Ole). 


The assumptions b # 0 and sin(8) # 1 ensure that z is nonzero and so 


I£-zh _ 1410)! 3 
Trh fei, * «7 (63.11) 


In order to bound || (0) |l, we differentiate (5.3.10) and set t = 0 in the 
result. This gives 


ET Az + AT Ez + AT Ai(0) = ATJ + ET 
ie, 
i(0) = (ATA)! AT(J — Ez) + (AT A) ! ETT. (5.3.12) 

By substituting this result into (5.3.11), taking norms, and using the easily 
verified inequalities || f ||, € | 5 |; and || E ||; € ||. A |; we obtain 

l£&-zl { TAIAT ( Il tle ) 

172—772 < eill Alh (ATA) 14 — n 4] 

leh 14M 274 h Tage lh 


PES ey ana REAT Al ; 
* Tati at, Al ha) in} + O(¢?). 


Since AT(Az — b) = 0, Az is orthogonal to Ax — b and so 
lo- Aziz +) Ariz =b. 


Thus, 
| AU x2 > Wb -ts 
and so by using (5.3.7) 


l2-zh . f, asin(£) z 
Tehk $ { (4) (zm * 1) + maA) a a + 0€) 
thereby establishing (5.3.8). 


To prove (5.3.9), we define the differentiable vector function r(t) by 
r(t) = (b+ tf) - (A + tE)z(t) 
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and observe that r = r(0) and f = r(e). Using (5.3.12) it can be shown 
that 
#(0) = (I — A(AT A)-! AT) (f — Ez) - A(AT A)! ETr. 


Since || # — r ||4 = el *(0) lla + O(c?) we have 


l-r 2 0 
[els [ble tO? 


IA 


{TAP Aya? | (1 + ABET 


l5 fle 


+1 AAPA)" LA] + 000. 


Inequality (5.3.9) now follows because 
YA lalz le = Alle ATO lly S el b Ia, 
eus = || (I - A(ATA) 1 AT, S IT- ACATA) 1 AT liall Olle 


and 
| (Z — A(AT A)7147 || = min(m — n, 1). O 


An interesting feature of the upper bound in (5.3.8) is the factor 


tan(S)&;(A)" = ——BS A’. 
OU 7 BE vs A 


Thus, in nonzero residual problems it is the square of the condition that 
measures the sensitivity of rg. In contrast, residual sensitivity depends 
just linearly on x4( A). These dependencies are confirmed by Example 5.3.1. 


5.3.8 Normal Equations Versus QR 


It is instructive to compare the normal equation and QR approaches to the 
LS problem. Recall the following main points from our discussion: 


o The sensitivity of the LS solution is roughly proportional to the quan- 
tity xa(A) + puse ( AY". 


* The method of normal equations produces an =, whose relative error 
depends on the square of the condition. 


* The QR approach (Househoider, Givens, careful MGS) solves a nearby 
LS probiem and therefore produces a solution that has a relative error 
approximately given by u(&a(À) + ppssa(A)^). 
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Thus, we may conclude that if prs is small and (A) is large, then the 
method of normal equations does not solve a nearby problem and will usu- 
ally render an LS solution that is less accurate than a stable QR approach. 
Conversely, the two methods produce comparably inaccurate results when 
applied to large residual, ill-conditioned problems. 

Finally, we mention two other factors that figure in the debate about 
QR versus normal equations: 


* The normal equations approach involves about half of the arithmetic 
when m >> n and does not require as much storage. 


e QR approaches are applicable to a wider class of matrices because 
the Cholesky process applied to AT A breaks down "before" the back 
substitution process on QTA = R. 


At the very minimum, this discussion should convince you how difficult it 
can be to choose the "right" algorithm! 


Problemas 


P5.3.1 Assume AT Az = AT), (AT A + F]z = ATb, and 2| F ]]a < o4 (4)7. Show that 
ifr = b— Az and f = b — Aż, then ê — r = A(ATA+ F)7! Fz and 


FI 
He -rilas amla Ele ila- 


P6.3.2 Assume that ATAr = ATb and that ATAŻ = ATH + f where ll f ll; £ 
cuj] AT Ilall 5 ||; and A hes full column rank. Show that 


T 
12-25 c on apa LAT lel 
izi; MENT 


P5.83.8 Let A € R™*" with m > n and y & R™ and define 4 = {A y] c E *(r*u, 
Show that m, (Å) > (A) and on41(A) X o (A). Thus, the condition grows if a column 
is added to a matrix. 


P5.3.4 Let A € R™*" (m > n), w € R^, and define 
A 
B= [ wT 


Show that gn (B) > n(A) and 1(B) S q/l AI + il w iz . Thus, the condition of a 
matrix may increase or decreaae if a row is added. 

P5.3.5 (Cline 1973) Suppose that A € R™*" has rank n and that Gaussian elimination 
with partial pivoting is used to compute the factorization PA = LU, where L € R™*" is 
unit lower triangular, U c RU" is upper triangular, and P c R”*™ ie a permutation. 
Explain how the decomposition in P5.25 can be used to find a vector z € R” such that 
|| Lz — Pb ||, is minimized. Show that if Uz = z, then || Az — b ||, is minimum. Show 
that this method of solving the LS problem is more efficient than Householder QR from 
the flop point of view whenever m < 5n/3. . 
P5.3.6 The matrix C = (A7 A)-!, where rank(A) = n, arises in many statistical appli- 
cations and is known as the variance-covariance matriz. Assume that the factorization 
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A = QR is available. (a) Show C = (RT R)-1. (b) Give an algorithm for computing the 
diagonal of C that requires n?/3 flops. (c) Show that 


T 
a v 


T T 
> C= {RTR = qa enfe? ns ] 


where C, = (ST S)-!. (d) Using (c), give an algorithm that overwrites the upper tri- 
angular portion of R with the upper triangular portion of C. Your algorithm should 
require 2n3/3 flops. 

P5.3.7 Suppose A € EC *" is symmetric and that r = b — Ar where r, b, z c R” and 
z is nonzero. Show how to compute a symmetric E € R^*? with minimal Frobenius 
norm so that (A + E)z = b. ore Use the QR factorization of [z, r] and note that 
Ez =r = (QTEQYQ72) = 

P5.3.8 Show how to compute one nearest circulant matrix to a given Toeplitz matrix. 
Measure distance with the Frobenius norm. 
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5.4 Other Orthogonal Factorizations 


If A is rank deficient, then the QR factorization need not give a basis for 
ran(A}. This problem can be corrected by computing the QR factorization 
of a column-permuted version of A, i.e., AI] = QR where II is a permuta- 
tion. 

The *data" in À can be compressed further if we permit right multipli- 
cation by a general orthogonal matrix Z; 


QT AZ =T. 
There are interesting choices for Q and Z and these, together with the 
columm pivoted QR factorization, are discussed in this section. 


5.41 Rank Deficiency: QR. with Column Pivoting 


If A c E'"*^ and rank(A) < n, then the QR factorization does not nec- 
essarily produce an orthonormal basis for ran(A). For example, if A has 
three columns and 


A = [an 42, a3] = [a5 92, 23] 


oo = 
oo = 
= = = 


is its QR factorization, then rank(A) = 2 but ran( A) does not equal any of 


the subspaces span(qi, q2}, span{qi, qs), or span{ga, da). 

Fortunately, the Householder QR. factorization procedure (Algorithm 
5.2.1) can be modified in a simple way to produce an orthonormai basis for 
ran(A). The modified algorithm computes the factorization 


-r (5.4.1) 


where r = rank(A), Q is orthogonal, Ry: is upper triangular and non- 
singular, and II is a permutation, If we have the column partitionings 
AII = [245,...,04, ] and Q = [q,...,q ], then for k = L:n we have 


min(r,k) 
a, = X Tdi € span(q,.... qr) 
ixl 
implying 
ran( A) = span{qi, EE Gr}: 


The matrices Q and II are products of Householder matrices and inter- 
change matrices respectively. Assume for some k that we have computed 
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Householder matrices Hj,..., H& ., and permutations II,,...,II4-, such 
that 


(Ay. -+ BJA +--+ Ty} = (5.4.2) 
k- - - 
"n [^ D pen k-1 
0 REO} m—k+1 
k-1 n-k+l 


where RROD is a nonsingular and upper triangular matrix. Now suppose 
that 
k- k- - 
RÈ Ye [4 HUNE D] 


iş a column partitioning and let p > k be the smallest index such that 
1297 a = max (1 esp at Ma} - (5.4.3) 


Note that if k—1 = rank( A), then this maximum is zero and we are finished. 
Otherwise, let IT; be the n-by-n identity with columns p and k interchanged 
and determine a Householder matrix Hg such that if R® = H, R/*-DIT,, 
then RO) {k + 1:m, k) = 0. In other words, IT, moves the largest column in 
RE- D to the lead position and Hy zeroes all of its subdiagonal components. 

The column norms do not have to be recomputed at each stage if we 
exploit the property 

az= [S] h = ee si-o, 

which holds for any orthogonal matrix Q € R’**. This reduces the overhead 
associated with column pivoting from O(mn?) flops to O(mn) fiops because 
we can get the new column norms by updating the old column norms, e.g., 


2 WE = B20 -rk 


Combining all of the above we obtain the following algorithm established 
by Businger and Golub (1965): 


Algorithm 5.4.1 (Householder QR With Column Pivoting) Given 
Ac R7*^ with m > n, the following algorithm computes r = rank(A) 
and the factorization (5.4.1) with Q = H; + H, and IT = [1,---I,. The 
upper triangular part of A is overwritten by the upper triangular part of 
R and components j + l: n of the jth Householder vector are stored in 
A{j + lim, j). The permutation II is encoded in an integer vector piv. In 
particular, II; is the identity with rows j and piv(j) interchanged. 
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for j = Ln 
c(j) = A(1:m, j)7 A(1:n, j) 
end 
r —0; r= maxíc(1),...,e(n)) 
Find smallest k with 1 < k < n s0 c(k) =T 
while r > 0 
r=ort+l 
piv(r) = k; A(1:m,r) o A(L:m,k); e(r) e e(k) 
[v, 8] = house( A(r:m, r)) 
A(r:m, r:n) = (Im-e41 — fvvT) A(r:m, rm) 
Alr + ln, 7r) = v(Z:im -r +1) 
fori=r+imn 


c(i) = eft) - Afri)? 


end 
ifr<n 
T = max{e(r + 1),...,c(n)) 
Find smallest k with r +1 £k € n so c(k) =r. 
else 
7T=0 
end 
end 


This algorithm requires 4mnr —2r?(m+n)+4r°/3 flops where r = rank(A). 
As with the nonpivoting procedure, Algorithm 5.2.1, the arthogonal matrix 
Q is stored in factored form in the subdiagonal portion of A. 


Example 5.4.1 If Algorithm 5.4.1 is applied to 


12 83 

_ [15 6 
A= |ia a]? 
1nm 


then II = [es &a e1] and to three significant digits wa obtain 


—182 —816  .$14 19 
E: ms ou gA [s -14600 —1.820 


.548 .000 A13  -.829 
-.T30 .408 200 310 


All = QR = 


5.4.3 Complete Orthogonal Decompositions 


The matrix R produced by Algorithm 5.4.1 can be further reduced if it 
is post-multiplied by an appropriate sequence of Householder matrices. In 
particular, we can use Algorithm §.2.1 to compute 


RT, 

zn n E] r (5.4.4) 
RI n-r 
12 
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where the Z; are Householder transformations and TA is upper triangular. 
Tt then follows that 


T 0 T 
T ATL. u 
QAZ =T= | 0 a | m-r . (5.4.5) 


T n—T 
where Z = IIZ; --- Z}. We refer to any decomposition of this form as a com- 
plete orthogonal decomposition. Note that null( A) = ran(Z(1:n,r + 1:n)). 
See P5.2.5 for details about the exploitation of structure in (5.4.4). 
5.4.3  Bidiagonalization 


Suppose A € R™™” and m > n. We next show how to compute orthogonal 
Ug (m-by-m) and Vg (n-by-n) such that 


d fh o0 o D 
0 d f 0 


(5.4.6) 


e 
P 
I . 
T 


URAVg = 


a 


Ug = U,---U, and Vg = Vi -+ Va—2 can each be determined as a product 
of Householder matrices: 


x X X X X X X X 

X X X X Ü x x x 

x x x x| |o x x x|-^ 

X X X X Ü x x x 

x X X X Ü x x x 

x x 0 0 x x 0 Q 

Q x x x o x x x 

0x x x|2.|o00x x |-5 

0 x x x 0 0 x x 

0 x x x 0 O0 x x 
x x 0 0 x x O O0 x x 0 0 
0 x x OQ QO x x O0 o x x O0 
00x x|[-|oox x|ZS.|o o0 x x 
0 0 x x 0 0 0 x 0 0 0 x 
0 0 x x 0 0 0 x 0. 00 0 
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In general, U, introduces zeros into the kth column, while V, zeros the 
appropriate entries in row k. Overall we have: 


Algorithm 5.4.2 (Householder Bidiagonalization) Given A c R™*" 
with m > n, the following algorithm averwrites A with Uz AVa = B where 
B is upper bidiagonal and Ug = U,---U, and Vp = Vi---Va-2. The 
essential part of U;’s Householder vector is stored in A({j + lim, j} and the 
essential part of V;’s Householder vector is stored in A(j, j + 2:n). 


for } = lin 
[v, 8] = house(A(j:m, j)) 
A(j:m, jin) = (Inm 441 — Avv ) A(j:m, jin) 
A(j + lim, j) = v(2:m - j 4- 1) 
ifj<n-2 
v, | = house(A(j, j + i:n)7) 
A(j:m,j + Ln) = A(j:m, j + 1:n)(I.; ~ Bv") 
A(G, j + 2:n) = v(2:n — 7)? 
end 
end 


This algorithm requires 4mn? — 4n3/3 flops. Such a technique is used in 
Golub and Kahan (1965), where bidiagonalization is Brat described. If the 
matrices Ug and Vg are explicitly desired, then they can be accumulated 
in 4m?n — 4n?/3 and 4n3/3 flops, respectively. The bidiagonalization of A 
is related to the tridiagonalization of AT A. See 58.2.1. 


Example 5.4.2 if Algorithm 5.4.2 is applied to 


12 3 

4 5 8 
Á-| 7 g gt? 

10 u ig 


then to three significant digits we obtain 


o [85 25. 9] — [im ow oo 
B= 0 T 0 ' 0 Ve =| 900 -.567 -.745 
a 0 0 0.00 -.745 667 


5.4.4 R-Bidiagonalization 


A faster method of bidiagonalizing when m ‘> n results if we upper trian- 
gularize A first before applying Algorithm 5.4.2. In particular, suppose we 
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compute an orthogonal Q € R”*™ such that 
T4, | fh 
ea- [5] 
is upper triangular. We then bidiagonalize the square matrix R;, 
UER Va = B. 


Here Ug and Vg are n-by-n orthogonal and B, is n-by-n upper bidiagonal. 
If Ug = Q diag (Ug, Im-n) then 


vrav- [^ |=2 


is a bidiagonalization of A. 

The idea of computing the bidiagonalization in this manner is mentioned 
in Lawson and Hanson (1974, p.119) and more fully analyzed in Chan 
(19822). We refer to this method as R-bidiagonalization. By comparing its 
flop count (2mn?+-2n3) with that for Algorithm 5.4.2 (4mn? —4n3/3) we see 
that it involves fewer computations (approximately) whenever m > 5n/3. 


5.4.5 The SVD and its Computation 


Once the bidiagonalization of A has been achieved, the next step in the 
Golub-Reinsch SVD algorithm is to zero the superdiagonal elements in B. 
This is an iterative process and is accomplished by an algorithm due to 
Golub and Kahan (1965). Unfortunately, we must defer our discussion of 
this iteration until $8.6 as it requires an understanding of the symmetric 
eigenvalue problem. Suffice it to say here that it computes orthogonal 
matrices Up and Vg such that 


UÍBVy; = E = diag(m,...,c,) € R”. 


By defining U = Ugly and V = VgVz we see that UT AV = E is the SVD 
of A. The flop counts associated with this portion of the algorithm depend 
upon “how much” of the SVD is required. For example, when solving the 
LS problem, UT need never be explicitly formed but merely applied to b 
as it is developed, In other applications, only the matrix U, = U(:, i:n) 
is required. Altogether there are aix possibilities and the total amount of 
work required by the SVD algorithm in each case is summarized in the 
table below. Because of the two possible bidiagonalization schemes, there 
are two columns of flop counts. If the bidiagonalization is achieved via 
Algorithm 5.4.2, the Golub-Reinsch (1970) SVD algorithm results, while if 
R-bidiagonalization is invoked we obtain the R-SVD algorithm detailed in 
Chan (1982a). By comparing the entries in this table (which are meant only 
as approximate estimates of work), we conclude that the R-SVD approach 
is more efficient unless m ^: n. 
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4mn? — 4n? /3 2mn? + 2n? 
4mn? 4- 8n? 2mn? + lln? 


4m?n - 8mn? 4m?n + 130? 


limn? — 2n? 6mn? + lin? 
Am?n + 8mm? + 9n? | 4m?n 4- 220? 
14mn? + 8n? mn? + 20n3 


Problema 


P5.4.1 Suppose A € R^ with m « n. Give an algorithm for computing the factor- 
ization 

UTAV =[8 0) 
where B is an m-by-m upper bidisgonal matrix. (Hint: Obtain the form 


x x 000 Q0 

0 x x 0 0 D 

o 0 x x 0 0|" 
x x oO 


0 0 Q0 
using Householder matrices and then "chase" the (m, m + 1) entry up the (m + i)st 
columa by applying Givens rotations from the right.) 


P5.4.2 Show how to efficiently bidiagonalize an n-by-n upper triangular matrix using 
Givens rotations. 


P5.4.3 Show how to upper bidiagonalize a tridiagonal matrix T c E**" using Givens 
rotations. 


P5.4.4 Let A € R^" and sure that 0 x v satisfies |] Av lla = on(A)!| v ja Let II 
be a permutation such that if IITw = w, then |tin| = || w llo. Show that if A = QR 
is the QR factorization of AT, then |r«a| < Vne«(.A). Thus, there always exists a 
permutation II such that the QR factorization of AIT "displays? near rank deficiency. 
P5.4.5 Let z,y € R™ and Q € R™*™ be given with Q orthogonal. Show that if 


EDO 
then uly = zT y = af. 


P5.4.8 Let A = [01,...,0n] € RY" and b c R™ be given. For any subset of A's 
columna (a, ,..., ac, ) define 


ra| acz. -saa ] = “ue I [aey,---,%e, ]z - H2 


e ni 


Describe an alternative pivot selection procedure for Algorithm 5.4.1 such that if QR = 
All = [a«,..., Ge, | in the final factorization, then for k = i:n: 


rem[ac,,...,ae, ] = min res[ac,,... 0e, 1 ae] 


Notes and References for Sec. 5.4 
Aspecta of the compiete orthogonal decomposition are discussed in 
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R.J. Hanson and C.L. Lawson (1969). “Extensions and Applicatious of the Householder 
Algorithm for Solving Linear Least Square Problems,” Math. Comp. $3, T8T-812. 

P.A. Wedin (1973). “On the Almost Rank-Deficient Case of the Least Squares Problem,” 
BIT 13, 344-54. 

G.H. Golub and V. Pereyra (1976). “Differentiation of Pseudo-Inverses, Separabie Non- 
linear Least Squares Problems and Other Tales,” in Generalized Inverses and Appli- 
cations , ed. M.Z. Neshed, Academic Press, New York, pp. 303-24. 


The computation of the SVD is detailed in $8.6. But here are some of the standard 
references concerned with ita calculation: 


G.H. Golub and W, Kahan (1965). “Calculating the Singular Values and Pseudo-Inverse 
of a Matrix,” SIAM J. Num. Anal 2, 205-24. 

P.A. Businger and G.H. Goiub (1969). "Algorithm 358: Singular Vaiue Decomposition 
of the Complex Matrix," Comm. ACM 12, 564-65. 

G.H. Golub and C. Reinach (1970). “Singular Value Decomposition and Least Squares 
Solutions" Numer. Math. i4, 403-20. See also Wilkinson and Reinsch(1971, pp. 
1334-51). 

T.F. Chan (1982). “An Improved Algorithm for Computing the Singular Value Decom- 
position," ACM Trans. Math. Soft. 8, T2-83. 


QR with column pivoting was first discussed in 


P.A. Businger and G.H. Golub (1965). "Linear Least Squares Solutions by Householder 
Transformations,” Numer. Math. 7, 269-76. See also Wilkinson and Reinsch (1971, 
PP. 11-18). 


Knowing when to stop in the algorithm is difficult, In questions of rank deficiency, it is 
helpful to obtain information about the mnallest singular value of the upper triangular 
matrix It. This can be done using the techniques of 53.5.4 or those that are discussed in 


L Karasalo (1974). “A Criterion for Truncation of the QR Decomposition Algorithm for 
the Singular Linear Least Squares Problem,” BIT 14, 156-66. 

N. Anderson and I. Karnsalo (1975). “On Computing Bounda for the Least Singular 
Value of a Triangular Matrix,” BIT 15, 1-4. 


Other aspects of rank estimation with QR are discussed in 


L.V. Foster (1986). "Rank and Null Space Calculations Using Matrix Decomposition 
without Column Interchanges,” Lin. Alg. and its Applic. 74, 47-71. 

T.F. Chan (1987). “Rank Revealing QR Pactorizstions,” Lin. Aig. and its Applic. 
88/89, 67-82. 

T.F. Chan and P. Hansen (1992). “Some Applications of the Rank Revealing QR. Fac- 
torization,” SIAM J. Sci. and Stat. Comp. 13, 721—741. 

J.L. Barlow and U.B. Versulapati (1992). “Rank Detection Methods for Sparse Matri- 
ces," SIAM J. Matriz. Anal Appl 13, 1279-1297. 

T-M. Hwang, W-W. Lin, and E.K. Yang (1992). “Rank-Revealing LU Factorizations," 
Lin. Aly. and fis Applic. 175, 115-141. 

C.H. Bischof and P.C. Hansen (1992). “A Block Algorithm for Computing Rank- 
Revealing QR Factorizations," Numerical Algorithms 2, 371-3072. 

S. Chandrasekaren and 1.C.F. Ipem (1904). “On Rank-Revealing Factorizations,” SIAM 
J. Matriz Anal. Appl. 15, 592-622. 

R.D. Fierro and P.C. Hansen (1995). "Accuracy of TSVD Solutions Computed from 
Rank-Revesling Decompositions,” Numer. Math. 70, 453-472. 
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5.5 The Rank Deficient LS Problem 


If A is rank deficient, then there are an infinite number of solutions to the 
LS problem and we must resort to special techniques. These techniques 
must address the difficult problem of numerical rank determination. 

After some SVD preliminaries, we show how QR with column pivoting 
can be used to determine à minimizer zg with the property that Arg is a 
linear combination of r = rank(A) columns. We then discuss the minimum 
2-norm solution that can be obtained from the SVD. 


5.5.1 The Minimum Norm Solution 


Suppose A € R™*" and rank(A) =r < n. The rank deficient LS problem 
has an infinite number of solutions, for if x is a minimizer and z € nuli(A) 
then x + z is also a minimizer. The set of all minimizers 
X = {rE R": || Ar -bijz = min } 
is convex, for if x1, T2 € ¥ and A € [0,1], then 
| A(Àxi + (173022) -bll S Al Az — b lla + (1 7 3) Aza — 6 ila 


= min|| Az -bla 


Thus, Az; + (1 — A)za € ¥. It follows that 7 has a unique element having 
minimum 2-norm and we denote this solution by Tzs. (Note that in the 
full rank case, there is only one LS solution and so it must have minimal 
2-norm. Thus, we are consistent with the notation in $5.3.) 


5.5.23 Complete Orthogonal Factorization and x5 


Any complete orthogonal factorization can be used to compute zrs. In 
particular, if Q and Z are orthogonal matrices such that 


T — _ Tu (H T 
Q AZ =T = | 0 Gt m-r r = rank(A) 


r n—r 
then 
|| Az — 8% =|] (QTAZ)ZTz -Q75 =| Tiw-clZ 4l? 


where 
T - Ww r Tr c r 
Zzc- M n-r Qi- [a] m-r 


Clearly, if z is to minimize the sum of squares, then we must have w = Tye. 
For x to have minimal 2-norm, y must be zero, and thus, 


1 
us = z| ^i]. 
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5.5.3 The SVD and the LS Problem 


Of course, the SVD is a particularly revealing complete orthogonal de- 
composition. It provides a neat expression for Tps and the norm of the 
minimum residual prs = || Artis — b |l} 


Theorem 5.5.1 Suppose UT AV. = X is the SVD of AG IR^*" with r = 
rank(A). FU —[uy,..., t4] and V = [w,..., Un] are column partition- 
ings and b e IR", then 
TuTh 
Ins = >y; (5.5.1) 
fi 
imi 
minimizes || Ar — b ||3 and has the smallest 2-norm of all minimizers. More- 
over m 
Pis = li Aris - bid = D (ula). (5.5.2) 


ior +1 
Proof. For any x € R" we have: 


|| Az — 513 I (UTAV)(V7 2} -U76|2 = | Ea ~UT (3 
$ (eios — urb) e 3 (uP)? 


i=l imr+1 


where a = VT z. Clearly, if z solves the LS problem, then a; = (uTb/c;) for 
i= Lm. LE we set a(r + 1:1) = 0, then the resulting z clearly has minimal 
2-norm, O 


5.5.4 The Pseudo-Inverse 
Note that if we define the matrix At € R°“™ by At = VE*UT where 


zt = diag (7... 0,0) eR"  r-rak(A) 
au Cr 
then zrs = Atb and prs = || (J — AA*)bl|a. A*t is referred to as the 
pseudo-inverse of A. It is the unique minimal Frobenius norm solution to 
the problem 
min — 
X ERM | AX — Im lp- (5.5.3) 


Hf rank(A) = n, then At = (A? A)-1 AT, while if m = n = rank(A), then 
At = A*l. Typically, At is defined to be the unique matrix X e R^" 
that satisfies the four Moore-Penrose conditions: 


(i) AXA = A (ii) (AX)? 
(i) XAX = X üv) (XA)? 


AX 
XA. 


il it 
i N 
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These conditions amount to the requirement that AAt and At A be orthog- 
onal projections onto ran(A) and ran(A7), respectively. Indeed, AAt = 
UU? where U, = U(1:m, Lr) and AtA = Vi VE where V; = V(I:n, Ir). 


5.5.5 Some Sensitivity Issues 


In $5.3 we examined the sensitivity of the full rank LS problem. The be- 
havior of zis in this situation ia summarized in Theorem 5.3.1. If we drop 
the full rank assumptions then Tzs is not even a continuous function of the 
data and small changes in A and b can induce arbitrarily large changes in 
tis = Átb. The easiest way to see this is to consider the behavior of the 
peeudo inverse. If A and 6A are in R™*", then Wedin (1973) and Stewart 
(1975) show that 


| (A +6A)* — At |p S 21 ôA |] pmax (T At I3 , I (A + 64)* I }- 


This inequality is a generalization of Theorem 2.3.4 in which perturbations 
in the matrix inverse are bounded. However, unlike the square nonsingular 
case, the upper bound does not necessarily tend to zero as 6A tends to zero. 


then 


ar=[ 4 0 o] and (Asay - [1 ? o] 


1 life 0 


and || A+ — (A +6A)* || = 1/e. The numerical determination of an LS 
minimizer in the presence of such discontinuities is a major challenge. 


5.5.6 QR with Column Pivoting and Basic Solutions 


Suppose A € R"*" has rank r. QR with column pivoting (Algorithm 5.4.1) 
produces the factorization AJ] = QR where 


_ [Ru u r 
R= | Ü 0 mor 


T n-Fr 


Given this reduction, the LS problem can be readily solved. Indeed, for 
any r € R” we have 


| Az — 62 


I (QT ADIT z) - (975) |2 
W Ruy — (e — Rz) l3 dali, 


Hl 


5.5. THE RANK DEFICIENT LS PROBLEM 259 
where 


To y T T. _ c r 
z= [z] n-r and Qb = [a] m-r 
Thus, if z is an LS minimizer, then we must have 


:-u| Ru (e - Riaz) | 


If z is set to zero in this expression, then we obtain the basic solution 


= Rire 
ZB-— I | 0 . 
Notice that zg has at most r nonzero components and so Atp involves a 
subset of A’s columns. 
The basic solution is not the minimal 2-norm solution unless the sub- 
matrix R2 is zero since 


l| zzslla = min 
ze RT 


(5.5.4) 


-Inr 


za -n| Ry Riz | 


2 


Indeed, this characterization of || zz ||; can be used to show 
ll za lla 


1 € ——— < J1-| Ri Ralf. 5.5.5 
il ZLS lz I 11 7142 [E ( ) 


See Golub and Pereyra (1976) for details. 


5.5.7 Numerical Rank Determination with AII = QR 


If Algorithm 5.4.1 is used to compute zg, then care must be exercised in 
the determination of rank(A). In order to appreciate the difficulty of this, 
suppose 
. ROO ge 
fA, -- A M) = RU = | uo | me 


is the matrix computed after k steps of the algorithm have been executed 
in floating point. Suppose rank(A) = k. Because of roundoff error, ÂW 
will not be exactly zero. However, if RO? is suitably smal] in norm then it 
is reasonable to terminate the reduction and declare A to have rank k. A 
typical termination criteria might be 


| RD lz Serf Alle (5.5.6) 
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for some smali machine-dependent parameter ey. In view of the roundoff 
properties associated with Householder matrix computation (cf. §5.1.12), 
we know that A() is the exact R factor of a matrix A + Ek, where 


l| Ex l2 Sel Ala € = O(n). 
Using Theorem 2.5.2 we have 
ka (A Ey) = o (R9) & | RY s. 
Since o441(A) € op41(A + Ex) + f| Ex lle, it follows that 
Tpl A) S (ei + e2)l A lla. 


In other words, a relative perturbation of O(e; +¢2) in A can yield a rank-k 
matrix. With this termination criterion, we conclude that QR with column 
pivoting "discovers" rank degeneracy if in the course of the reduction RO 
is small for some k « n. 

Unfortunately, this is not always the case. A matrix can be nearly rank 
deficient without a single RO being particularly small Thus, QR with 
column pivoting by itself is not entirely reliable as a method for detecting 
near rank deficiency. However, if a good condition estimator is applied to 
R it is practically impossible for near rank deficiency to go unnoticed. 


Example 5.5.1 Let T4(c) be the matrix 


1 -c -=c -=c 

0 1 -e -c 
Tale) = disg(1,s,... i71) 

: 1 -c 

0 E i 


with c? +a? = 1 with e, 3 > 0 (See Lawson and Hanson (1974, p.31).) These matrices are 
unaltered by Algorithm 5.4.1 and thus || RP | > s"—! for k  :n— 1. This inequality 
implies (for example} that the matrix Tioo(.2) has no particularly small trailing principal 
submatrix since 2** 5:13. However, it can be shown that o, = O(1078). 


5.5.8 Numerical Rank and the SVD 


We now focus our attention on the ability of the SVD to handle rank- 
deficiency in the presence of roundoff. Recall that if A = VEV™ is the 
SVD of A, then 
r Th 
tis = Di? (5.5.7) 
int 76 
where r = rank(A). Denote the computed versions of U, V, and E = 
diag(c;) by U, V, and È = diag(ó;). Assume that both sequences of singular 
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values range from largest to smallest. For a reasonably implemented SVD 
algorithm it can be shown that 


T=W+AU WTW=ln HAUSE (5.5.8) 


V=Z+4V  Z'Z-l, || AVila<e (5.5.9) 
E£-WTI(AKAAZ [AA S el Alla (5.5.10) 


where ¢ is a small multiple of u, the machine precision. In plain English, the 
SVD algorithm computes the singular values of a “nearby” matrix A+ AA. 

Note that Ü and V are not necessarily close to their exact counterparts. 
However, we can show that & is close to zy. Using (5.5.10) and Theorem 
2.5.2 we have 


lt 


min  |A-Bll; 
rank(B)-k—i 


CE 


min  [(É-B) -WT(AAMZ ls. 
rank(B)—k— 1 


Since Il WT(AA)Z lla < ell A lla = egi and 


min = || f£, — Bla = oe 
rank( B) k-1 


it follows that |o, — ĉk| € eo; for k = 1:n. Thus, if A has rank r then we 
can expect n — r of the computed singular values to be small. Near rank 
deficiency in A cannot escape detection when the SVD of A is computed. 


Example 5.5.2 For the matrix T,9o(.2) in Example 5.5.1, on 55.367 - 1075. 

One approach to estimating r = rank(.A) from the computed singular 
values is to have a tolerance ó > 0 and & convention that A has “numerical 
'rank" r if the d; satisfy 


ize x0,2602z60,42.205 


The tolerance 5 should be consistent with the machine precision, e.g. 6 = 
ul| A |loo. However, if the general level of relative error in the data is larger 
than u, then 6 should be correspondingly bigger, e.g., 6 = 10-74] A llo if 
the entries in A are correct to two digits. 

Tf # is accepted as the numerical rank then we can regard 
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as an approximation to Trg. Since || xe jl = 1/a; < 1/6 then ô may also 
be chosen with the intention of producing an approximate LS solution with 
suitably small norm. In §12.1, we discuss more sophisticated methods for 
doing this. 

If &; 7» ô, then we have reason to be comfortable with rs because A 
can then be unambiguously regarded as a rank(A;) matrix (modulo £). 

On the other hand, {@1,...,é,} might not clearly split into subsets 
of small and large singular values, making the determination of ? by this 
means somewhat arbitrary. This leads to more complicated methods for 
estimating rank which we now discuss in the context of the LS problem. 

For example, suppose r = n, and assume for the moment that AA = 0 
in (5.5.10). Thus s; = o, for í = Ln. Denote the ith columns of the 
matrices U, W, V, and Z by uj, wi, w, and z,, respectively. Subtracting 
x; from zps and taking norms we obtain 


? T T 
wib-íu;b 
ELS la s 5 M i ) ( d Jui lla 


I| ze 
ixl ei ixl 
From (5.5.8) and (5.5.9) it is easy to verify that 
| (w7 b)z; — (uP5)villa < 2(1 eel bl (5.5.11) 


and therefore 


f n wtb 2 
| Ze — zrs lle S z0 tel 5la + X (=) . 


impp S C4 


The parameter ? can be determined as that integer which minimizes the 
upper bound. Notice that the first term in the bound increases with 7, 
while the second decreases. 

On occasions when minimizing the residual is more important than ac- 
curacy in the solution, we can determine f on the basis of how close we 
surmise || b — Az; ||; is to the true minimum. Paralleling the above analy- 
sis, it can be shown that 


1è- Aze lla - Ib- Azzs lla S (n blo + eld ia (Zaro) . 


Again f could be chosen to minimize the upper bound. See Varah (1973) 
for practical details and also the LAPACK manual. 


5.5.9 Some Comparisons 


As we mentioned, when solving the LS problem via the SVD, only E and 
V have to be computed. The foilowing table compares the efficiency of this 
approach with the other algorithms that we have presented. 


5.5. THE RANK DEFICIENT LS PROBLEM 263 


Normal Equations mn? 4- n3/3 
Householder Orthogonalization | 2mn? — 2n?/3 
Modified Gram Schmidt 2mn? 


Givens Orthogonalization 3mn? — n? 
Householder Bidiagonalization | 4mn? — 4n?/2 
R-Bidiagonalization Imn? + 22? 
Golub-Reinsch SVD 4mn? + 8n? 
R-SVD 2mn? + lin? 


Probiems 


P5.5.1 Show that if 
A= T S T 
~ 0 o m-r 
r n-r 
where r = rank( A) and T is nonsingular, then 
_ T 0 r 
x= [ o ü ] her 
r m-r 


satisfies AXA = A and (AX)7 = (AX). In this case, we say that X is a (1,3) pseudo- 
inverse of A. Show that for general A, zg = Xb where X is a (1,3) pseudo-inversa of A. 


P5.5.2 Define B(A} € R°™™ by B(A) = (AT A + AI)! AT, where A > 9. Show 
A 
B(A) - At fg = ————u—— = rank(A 
RBQ)-AY Ia = VUEAHTA Co 
and therefore that B(A) — At as A — 0. 
P5.5.3 Consider the rank deficient LS problem 


m [lo s]li] - Eel 

yeR" 0 0 z d jil 

z€R"7" 
where R c "^, Sc R'*^77, y c RY, and z c E777. Assume that R is upper triangu- 
lar and nonsingulaz. Show how to obtain tbe minimum norrn solution to this problem 
by computing an appropriate QR factorization without pivoting and then solving for the 
appropriate y And z. 
P5.5.4 Show that if A, — A and AF — At, then there exista an integer ko such that 
rank(A,) is constant for all k > ko. 
P5.5.5 Show that if A c E'**" has rank n, then so does A+ E if we have the inequality 
] E lal At Ila < 1- 


Notes and References for Sec. 5.5 
The paeudo-inverse literature ia vast, ax evidenced by the 1,775 references in 
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M.Z. Nashed (1976). Genemlized Inverses and Applications, Academic Press, New York. 
The differentiation of the peeudo-inverse is further discussed in 


C.L. Lawson and R.J. Hanson (1969). “Extensions and Applications of the Householder 
i for Solving Linear Least Squares Problema," Math, Comp. 23, 787-812. 
G.H. Golub and V. Pereyra (1973). “The Differentiation of Psendo-Inverses and Nonlin- 
ear Least Squares Problems Whose Variables Separate," SIAM J. Num. Anal 10, 
413-32. 


Survey treatments of LS perturbation theory may be found in Lawson and Hanson 
(1974), Stewart and Sun (1991), Bjérck (1996), and 


P.A. Wedin (1972). “Perturbation Theory for Pseudo-Inverses,” BIT 13, 217-32. 
G.W. Stewart (1977). “On the Perturbation of Pseudo-Inverses, Projections, and Linear 
Least Squares,” SIAM Review 19, 634-62. 


Even for full rank problems, column pivoting seems to produce more accurate solutions. 

The error analysis in the following paper attempts to explain why. 

L.S. Jennings and M.R. Osborne (1974). “A Direct Error Analysis for Least Squares,” 
Numer. Math, 22, 322-32. 

Various other aspects rank deficiency are discussed in 

J.M. Varah (1973). “On the Numerical Solution of III-Conditioned Linear Systema with 
Applications to Ill-Posed Problems,” SIAM J. Num. Anal. 10, 257-67. 

G.W. Stewart (1984). “Rank Degeneracy,” SIAM J. Sci. and Stat. Comp. 5, 403-413. 

P.C. Hansen (1987). "The Truncated SVD as a Method for Regularization,” BIT 27, 
534-553. 

G.W. Stewart (1987). "Collnearity and Least Squares Regression," Statistical Science 
2, 68-100. 


We have more to say on the subject in §12.1 and §12.2. 


5.6 Weighting and Iterative Improvement 


The concepts of scaling and iterative improvement were introduced in the 
Chapter 3 context of square linear systems. Generalizations of these ideas 
that are applicable to the least squares problem are now offered. 

5.6.1 Column Weighting 

Suppose G c RC" is nonsingular. A solution to the LS problem 


min || Az — b ||; AcR"*"^, b¢R™ (5.6.1) 
can be obtained by finding the minimum 2-norm solution yrs to 
min | (AG)y — Bll (5.6.2) 


and then setting rg = Gyrs. If rank(A) = n, then rg = rrs. Otherwise, 
za is the minimum G-norm solution to (5.6.1), where the G-norm is defined 
by Il z lla = 1 G7z Mla. 
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The choice of G is important. Sometimes its selection can be based on 
à priori knowledge of the uncertainties in A. On other occasions, it may be 
desirable to normalize the columns of A by setting 


G=G= diag(1/l A(:,1) fla, «5l AC, n) ll) M 


Van der Sluis (1969) has shown that with this choice, x;( AG) is approxi- 
mately minimized. Since the computed accuracy of yrs depends on &3(. AG), 
a case can be made for setting G = Gp. 

We remark that coiumn weighting affects singular values. Consequently, 
a scheme for determining numerical rank may not return the same estimates 
when applied to A and AG. See Stewart (1984b). 


5.6.2 Row Weighting 


Let D = diag(di,...,d,,) be nonsingular and consider the weighted least 
squares problem 


minimize | D(Ar -5)l; Ae R™", be R^. (5.6.3) 


Assume rank(A) = n and that ap solves (5.6.3). It follows that the solution 
zis to (5.6.1) satisfies 


zp- zis = (ATD'A) M AT(D? — I)(b Azs). (5.6.4) 


This shows that row weighting in the LS problem affects the solution. (An 
important exception occurs when b € ran( A) for then rp = zs.) 

One way of determining D is to let dẹ be some measure of the un- 
certainty in bg, e.g., the reciprocal of the standard deviation in 54. The 
tendency is for ry = eT (b — Azp) to be small whenever d, is large. The 
precise effect of d, on ry can be clarified as follows. Define 


D(6) = disg(di,... dg i, de VI T 3, deu, dm) 


where 6 > —1. If z(5) minimizes || D(6)( Ax — b) ||; and r¢(é) is the k-th 
. component of b — Az(5), then it can be shown that 


MO = T ORJAA DA ATR (5.6.5) 


This explicit expression shows that ry(5) is a monotone decreasing function 
of 6. Of course, how ry, changes when all the weights are varied is much 


more complicated. 
Example 5.8.1 Suppose 
1 
0 
b= 0 . 
(H 


a 

Nu 
r 
PEELE, 
G Qd e 
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M D = h then zp = [-1, .85]T and r = b — Azp = [.3, —4 —.1, 2]7. On 
the other hand, if D = diag( 1000, 1, 1, 1) then we have zp ® [—1.43, 1.21 ]T and 
r =b- Azp = [.000428 — 571428 — .142853 285714 |T. 


5.6.3 Generalized Least Squares 


In many estimation problems, the vector of observations b is related to x 
through the equation 
b= Arcas (5.6.6) 
where the noise vector w has zero mean and a symmetric positive defi- 
nite vartance-covariance matrix o° W. Assume that W is known and that 
W = BBT for some B c IR?*^, The matrix B might be given or it might 
be W's Cholesky triangle. In order that all the equations in (5.6.6) con- 
tribute equally to the determination of z, statisticians frequently solve the 
LS problem 
min|| B^ !(Az — b) ||; . (5.6.7) 
An obvious computational approach to this problem is to form A = B^ 14 
and à = Bb and then apply any of our previous techniques to minimize 
| Az — 5|la. Unfortunately, z will be poorly determined by such a proce- 
dure if H is ill-conditioned. 
À much more stable way of solving (5.6.7) using orthogonal transforma- 
tions has been suggested by Paige (1979a, 1979b). It is based on the idea 
that (5.6.7) is equivalent to the generalized least squares problem, 


min vy, (5.6.8) 
bm Az+Bu 


Notice that this problem is defined even if A and B are rank deficient. 

Although Paige's technique can be applied when this is the case, we shall 

describe it under the assumption that both these matrices have full rank. 
The first step is to compute the QR factorization of A: 


R 
vgA-|*] e-i& «1 
n m-n 
An orthogonal matrix Z € R™*™ is then determined so that 


FBZ=[0 S] Z=(% AZ] 
"n m-n n m-—n 


where 3 is upper triangular. With the use of these orthogonal matrices the 
constraint in (5.6.8) transforms to 


b4 = HE bid area Ia. 
QTd 0 0 S Ziv 
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Notice that the “bottom half” of this equation determines v, 
SuzQÍib v=Zm, (5.6.9) 

while the "top half” prescribes z: 
Riz -QTb-(QTBZIZT + QT B221 je = Qib - QE BZ. (5.6.10) 
The attractiveness of this method is that all potential ill-conditioning is 
concentrated in triangular systems (5.6.9) and (5.6.10). Moreover, Paige 


(19795) has shown that the above procedure is numerically stable, some- 
thing that is not true of any method that explicitly forms B-' A. 


5.6.4 Iterative Improvement 


A technique for refining an approximate LS solution has been analyzed by 
Björck (1967, 1968). It is based on the idea that if 


| aliz] = [a] Ac€R""",5cR" (5.6.11) 


then || b — Az a = min. This follows because r+ Az = b and ATr = D imply 
AT Az = ATb. The above augmented system is nonsingular if rank(A) = 
n, which we hereafter assume. 

By casting the LS problem in the form of a square linear system, the 
iterative improvement scheme (3.5.5) can be applied: 


7) 2 9; 2 — 0 
for k — 0,1, 


ph b I Alfr” 
[eo | [a] - Le o] [ts | 
I A p? fe) 
Lt 0] [26] - [ae | 
ptt rík) pe 
| NI | = | Ze | * | z0 | 
end 


The residuals ff) and g% must be computed in higher precision and an 
original copy of A must be around for this purpose. 
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If the QR factorization of A is available, then the solution of the aug- 
mented system is readily obtained. In particular, if A = QR and R, = 
R(1:n, 1:n), then a system of the form 


[æ vll] 7 [5] 


transforms to 
In i) Rn A fi 
0 Im-n 0 fa = fa 
R o OQ z 9 
where 
f n h n 
ar- [A] mia Fra [i]. 


Thus, p and z can be determined by solving the triangular systems RTh = g 
and Riz = fi - hand setting p = Q| H | Assuming that Q is stored in 


factored form, each iteration requires 8mn — 2n? flops. 

The key to the iteration's success is that both the LS residual and so- 
lution are updated—not just the solution. Björck (1968) showa that if 
Ka( A) = T and i-digit, -base arithmetic is used, then z“) has appraxi- 
mately k(t — q} correct base 8 digits, provided the residuals are computed 
in double precision. Notice that it is x2(A), not &2(4)7, that appears in 
this heuristic. 

Problems 


P5.6.1 Verify (5.6.4). 
P5.6.2 Let A € R™*" have fuil rank and define the diagonal matrix 


A = diag( 1,..., 1, (1 6), 1,...,1) 
M ——— 
k-l m-—h 


for 6 > —L. Denote the LS solution ta min || A(Az — b) ||; by z(&) and its residual by 
r(6) = b — Az(8). (a) Show 


"5 = ( S MATA) AT Cnet ye. 


1+ def A(AT A) AT en 
(b) Letting ry (5) stand for the kth component of r(5), show 
_ re(0) 
reló) = ETIT OTT VETA(ATA) AT 
(c) Use (b) to verify (5.6.5). 


P5.8.3 Show how the SVD can be used to solve the generalized LS problem when the 
matrices A and B in (5.6.8) are rank deficient. 
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P5.8.4 Let A € E'"*" have rank n and for a > 0 define 


Mo = [UR 4 


Show that 
om+n(M(a)) = min fa. -F + font? + Gi 


and determine the value of o that minimizes «3(M (a)). 
P5.6.5 Another iterative improvement method for LS problems is the following: 


2) 20 

for k = 0,1,... 
rik) 2b — Ar(™) (double precision) 
j Act — 709 i = min 
aD us pO) 4 fh) 

end 


(a) Assuming that the QR factorization of A is available, how many flopa per iteration 
are required? (b) Show that the above iteration results by setting g*) = 0 in the itera- 
tive improvement scheme given in 55.6.4. 


Notes and References for Sec. 5.6 


Row and column weighting in the LS probiem is discussed in Lawson and Hanson (SLS, 
pp. 180-88), The various effects of scaling are discussed in 


A. van der Sluis (1969). “Condition Numbers and Equilibration of Matrices,” Numer. 
Math. 14, 14-23. 

G.W. Stewart (1984b). “On the Asymptotic Behavior of Scaled Singular Value and QR 
Decompositions,” Math. Comp. 43, 483—490. 


The theoretical and computational aspects of the generalized least squares problem ap- 

pear in 

S. Kourouklis and C.C. Paige (1981). “A Constrained Least Squares Approach to the 
General Gauss-Matow Linear Modal,” J. Amer. Stat Assoc. 76, 820-25. 

C.C. Paige (1979a). "Computer Solution and Perturbation Analysis of Generalized Least 
Squares Problems,” Math Comp. 33, 171-84. 

C.C. Paige (1979). “Fast Numerically Stable Computations for Generalized Linear 
Least Squares Problema," SIAM J. Num. Anal 16, 165-71. 

C.C. Paige (1985). “The General Limit Model and the Generalized Singular Value 
Decomposition,” Lin. Alg. and [ts Applic. 70, 269-284. 


Iterative improvement in the least squares context is discussed in 


G.H. Golub and J.H. Wilkinson (1966). “Note on Iterative Refinement of Lanst Squares 
Solutions,” Numer. Math. 9, 139-48. 

Å. Björck and G.H. Golub (1967). “Iterative Refinement of Linear Least Squares Solu- 
tions by Householder Transformation,” BIT 7, 322-37. 

A, Björck (1967). “Iterative Refinement of Linear Least Squares Solutions I” BIT 7, 
257-78. 

A. Björck (1968). “Iterative Refinement of Linear Least Squares Solutions IL" BIT 8, 
8-30. 


A. Björck (1987). "Stability Analysis of the Method of Seminormal Equations for Linear 
Least Squares Problems,” Linear Alg. and Its Appiic. 88/89, 31-48. 
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5.7 Square and Underdetermined Systems 


The orthogonalization methods developed in thia chapter can be applied to 
square systems and also to systems in which there are fewer equations than 
unknowns. In this brief section we discuss some of the various possibilities. 


5.7.1 Using QR and SVD to Solve Square Systems 


The least squares solvers based on the QR factorization and the SVD can 
be used to solve square linear systems: just set m — n. However, from 
the flop point of view, Gaussian elimination is the cheapest way to solve 
a square linear system as shown in the following table which assumes that 
the right hand side is available a£ the time of factorization: 


Gaussian Elimination 
Householder Orthogonalization 


Modified Gram-Schmidt 
Bidiagonalization 
Singular Value Decomposition 


Nevertheiess, there are three reasons why orthogonalization methods might 
be considered: 


* The flop counts tend to exaggerate the Gaussian elimination advan- 
tage. When memory traffic and vectorization overheads are consid- 
ered, the QR approach is comparable in efficiency. 


© The orthogonalization methods have guaranteed stability; there is no 
“growth factor" to worry about as in Gaussian elimination. 


* In cases of ill-conditloning, the orthogonal methods give an added 
measure of reliability. QR with condition estimation is very depend- 
able and, of course, SVD is unsurpassed when it comes to producing 
a meaningful solution to a nearly singular system. 


We are not expressing a strong preference for orthogonalization methods 
but merely suggesting viable alternatives to Geuseian elimination. 

We also mention that the SVD entry in Table 5.7.1 assumes the avail- 
ability of b at the time of decomposition. Otherwise, 201? flops are required 
because it then becomes necessary to accumulate the U matrix. 

If the QR factorization is used to solve Ar = b, then we ordinarily 
have to carry out a back substitution: Rr = Q7b. However, this can be 
avoided by “preprocessing” b. Suppose H is a Householder matrix such 
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that Hb = Ge, where e, is the last column of I4. If we compute the QR 
factorization of (H.A)7, then A = ET RT QT and the system transforms to 


RTy = Ben 


where y = Qr. Since RT is lower triangular, y = (8/ran}én and so 
rz Fain). 


5.7.2 Underdetermined Systems 
We say that a linear system 


Ar=b AER™” beR™ (5.7.1) 


is underdetermined whenever m < n. Notice that such a system either has 
no solution or has an infinity of solutions. In the second case, it is important 
to distinguish between algorithms that find the minimum 2-norm solution 
and those that do not necessarily do so. The first algorithm we present is 
in the latter category. Assume that A has full row rank and that we apply 
QR with column pivoting to obtain: 


QT" An = (A Ra] 


where FK; € IR"*™ is upper triangular and Ra e IR?" ^77). Thus, Ar =b 
transforms to 


(Q7 AIT 2) = [R m| 2 | = Q7 


afr = |” | 
7 
with zı € R^ and z; € Rí^779, By virtue of the column pivoting, R is 


nonsingular because we are assuming that A has full row rank. One solution 
to the problem is therefore obtained by setting z) = RI QT and z4 = 0. 


where 


Algorithm 5.7.1 Given A e IR™*" with rank(A) = m and b € R™, the 
following algorithm finds an x € R” such that Ax = b. 

QTAU=R_ (QR with column pivoting.) 

Solve R(1:m, l:m)z, = QT b. 


st z= 7 


272 CHAPTER 5. ORTHOGONALIZATION AND LEAST SQUARES 


This algorithm requires 2m?n — m?/3 flops. The minimum norm solution 
ig not guaranteed. (A different II would render a smaller z1.) However, if 
we compute the QR factorization 


ar =gr=a| % | 
with Ry € R™*™, then Ar = b becomes 
z 
(QR)'z = [ RT ofz] =b 


where 


Now the minimum norm solution does follow by setting zz = 0. 


Algorithm 5.7.2 Given A & IR"*" with rank(A) = m and b € IR, the 
following algorithm finds the minimal 2-norm solution to Ar = b. 


AT -QR (QR factorization) 
Solve R(1:n, l:m)Tz = b. 
z= Q(:, l:m)z 


This algorithm requires at most 2m?n — 2m? /3 
The SVD can also be used to compute the minimal norm solution of an 
underdetermined Ar = b problem. If 


A= Too? r = rank(A) 


im] 


is A's singular value expansion, then 


As in the least squares problem, the SVD approach is desirable whenever 
A is nearly rank deficient. 


5.7.3 Perturbed Underdetermined Systems 


We conclude this section with a perturbation result for full-rank underde- 
termined systems. 
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Theorem 5.7.1 Supposerank(A) = m € n and that A € R™" ,6A c Ig*^, 


0 z bec", and 6b c R” satisfy 
€ = max(ea.&]) < ex (A), 


where ea = || 64 ia/f A |z and e = || ôb la/| b ||a. If z andè are minimum 
norm solutions that satisfy 


Az =b (A+ 5A)B = b+ 6b 
then 
lê- z lz 
| x ll 


Proof. Let E and f be defined by 6A/e and 66/e. Note that rank(A + tE) = 
m for all 0 < t < «e and that 


a(t) = (A+ tE)" ((A-- tEYA *-£E)7) ' (b-- tf) 


satisfies (A + LE)z(t) = b+ tf. By differentiating this expression with 
respect to £ and setting t = 0 in the result we obtain 


i(0) = (I — AT(AAT)-! A) ET(AAT)-!b + AT(AAT)-1(f — Ez). 


S s(A)(camin(2,n —m 1} +e) + O(e?). 


Since 
eile = I| AT(AAT) b [la z es A)| (AATY 15 Ilo, 
l| 1 — AT(AAT)-tA |a = min(1,n — rn), 
and 
file . Wf iai A liz 
zl» ~ dla ' 
we have 
l-l _ z()-2z(0 _ ll (0) Ia 3 
iz T daOh ~ ‘ten 50€) 


tace (Ele, fle, LER 
S ¢min(1,n {Get bola * Lre Jaa + oe) 


from which the theorem follows. O 


Note that there is no (Ay factor as in the case of overdetermined systems. 


Problems 


P5.7.1 Derive the above expression for z(0). 

P5.7.2 Find the minimal norm solution to the system Ar = b where A = [123] and 
b=1. 

P5.7.3 Show how trianguiar system soiving can be avoided when using the QR factor- 
ization to solve an underdetermined system. 

P5.7.4 Suppose b,x € R” are given. Consider the following problems: 
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(a) Find an unsymmetric Toeplitz matrix T so Tz = b. 

(b) Find a symmetric Toeplits matrix T so Tz = b. 

(c) Find a circulant matrix C so Cx = b. 
Pose each problem in the form Ap = b where A iz a matrix made np of entries from = 
and p is the vector of sought-after parameters. 


Notes and References for Sec. 5.7 
Interesting aspects concerning singular systems are discussed in 


T.F. Chan (1984). “Deflated Decomposition Solutions of Nearly Singular Systems,” 
SIAM J. Num. Anal 21, T38-T54. 

G.H. Golub and C.D. Meyer (1986). “Using the QR Factorization and Group Inversion 
to Compute, Differentiate, and estimate the Sensitivity of Stationary Probabilities 
for Markov Chains" SIAM J. Alg. and Dis. Methods, 7, 273-281. 


Papers on underdetermined systems include 


R.E. Cline and F.J. Plemmons (15976). “Z-Solutions to Underdetermined Linear Sys- 
tems,” SIAM Review 18, 92-106. 

M. Arioli and A. Laratta (1985). “Error Analysis of an Algorithm for Solving an Under- 
determined System,” Numer. Math. 46, 255-268. 

J.W. Demmel and N.J. Higham (1993). "Improved Error Bounds for Underdetermined 
System Solvers,” SIAM J. Matriz Anal. Appl 14, 1-14. 


The QR factorization can of course be used to solve linear systema. See 


N.J. Higham (1991). “Iterative Refinement Enhances the Stability oí QR Factorization 
Methods for Solving Linear Equations," BIT 31, 447-468. 


Chapter 6 


Parallel Matrix 
Computations 


$6.1 Basic Concepts 
$6.2 Matrix Multiplication 
$6.3  Factorizations 


The parallel matrix computation area has been the focus of intense 
research. Although much of the work is machine/system dependent, a 
number of basic strategies have emerged. Our aim is to present these along 
with a picture of what it is like to “think parallel" during the design of a 
matrix computation. 

The distributed and shared memory paradigms are considered. We use 
matrix-vector multiplication to introduce the notion of a node program in 
86.1. Load balancing, speed-up, and synchronization are also discussed. 
In $6.2 matrix-matrix multiplication is used to show the effect of blocking 
on granularity and to convey the spirit of two-dimensional data flow. Two 
parallel implementations of the Cholesky factorization are given in 56.3. 


Before You Begin 


Chapter 1, 54.1, and $4.2 are assumed. Within this chapter there are 
the following dependencies: 


$61 - 562 — $63 


Complementary references include the books by Schónauer (1987), Hock- 
ney and Jesshope (1988), Modi (1988), Ortega (1988), Dongarra, Duff, 


ave 
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Sorensen, and van der Vorst (1991), and Golub and Ortega (1993) and the 
excellent review papers by Heller (1978), Ortega and Voight (1985), Galli- 
van, Plemmons, and Sameh (1990), and Demmel, Heath, and van der Vorst 
(1993). 


6.1 Basic Concepts 


In this section we introduce the distributed and shared memory paradigms 
using the gaxpy operation 


z—g-Ar, AER” 7z,y,z e R’ (6.1.1) 


as an example. In practice, there is a fuzzy line between these two styles 
of paralle] computing and typically a blend of our comments apply to any 
particular machine. 


6.1.1 Distributed Memory Systems 


In a distributed memory multiprocessor each processor has a local mem- 
ory and executes its own node program. The program can alter values in 
the executing processor'a local memory and can send data in the form of 
messages to the other processors in the network. The interconnection of 
the processors defines the network topology and one simple example that 
is good enough for our introduction is the ring. See FIGURE 6.1.1. Other 


FIGURE 6.1.1 A Four-Processor Ring 


important interconnection schemes include the mesh and torus (for their 
close correspondence with two-dimensional arrays), the hypercube (for its 
generality and optimality), and the tree (for its handling of divide and 
conquer procedures). See Ortega and Voigt (1985) for a discussion of the 
possibilities, Our immediate goal is to develop a ring algorithm for (6.1.1). 
Matrix multiplication on a torus is discussed in $6.2. 

Each processor has an identification number. The uth processor is des- 
ignated by Proc(u). We say that Proc(A) is a neighbor of Proc(u) if there 
is a direct physical connection between them. Thus, in a p-processor ring, 
Proc(p — 1) and Proc(1) are neighbors of Proc(p). 


6.1. Basic CONCEPTS ITT 


Important factors in the design of an effective distributed memory al- 
gorithm include (a) the number of processors and the capacity of the local 
memories, (b) how the processors are interconnected, (c) the speed of com- 
putation relative to the speed of interprocessor communication, and (d) 
whether or not a node is able to compute and communicate at the same 
time. 


6.1.2 Communication 


To describe the sending and receiving of messages we adopt a simple nota- 
tion: 


send( (matriz) , {id of the receiving processor} ) 
recv( {matriz} , (id of the sending processor] } 


Scalars and vectors are matrices and therefore messages. In our model, 
if Proc(j) executes the instruction send{ Viss, 4), then a copy of the local 
matrix Vis. is sent to Proc(A) and the execution of Proc(;:)’s node program 
resumes immediately. It is legal for a processor to send a message to itself. 
To emphasize that a matrix ia stored in a local memory we use the subscript 
“ioc” 

If Proc(;t) executes the instruction recv(Uioe, À), then the execution of 
its node program is suspended until a message is received from Proc(A). 
Once received, the message is placed in a local matrix Ur,, and Proc(u) 
resumes execution of its node program. 

Although the syntax and semantics of our send/receive notation is ad- 
equate for our purposes, it does suppress a number of important details: 


* Message assembly overhead. In practice, there may be a penalty 
associated with the transmission of a matrix whose entries are not 
contiguous in the sender’s local memory. We ignore this detail. 


e Message tagging. Messages need not arrive in the order they are sent, 
and a system of message tagging is necessary so that the receiver is 
not "confused." We ignore this detail by assuming that messages do 
arrive in the order that they are sent. 


+ Message interpretation overhead. In practice a message is a bit string, 
and a header must be provided that indicates to the receiver the 
dimensions of the matrix and the format of the floating point words 
that are used to represent its entries. Going from message to stored 
matrix takes time, but it is an overhead that we do not try to quantify. 


These simplifications enable us to focus on high-level algorithmic ideas. But 
it should be remembered that the success of a particular implementation 
may hinge upon the control of these hidden overheads. 
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6.1.3 Some Distributed Data Structures 


Before we can specify our first distributed memory algorithm, we must 
consider the matter of data layout. How are the participating matrices and 
vectors distributed around the network? 

Suppose z c IR” is to be distributed among the local memories of a p- 
processor network. Assume for the moment that n = rp. Two “canonical” 
approaches to this problem are store-by-row and store-by-column. 

In store-by-column we regard the vector x as an r-by-p matrix, 


z(1+(p—1)rin) |, 


and store each column in a processor, Le, r(1 + {p — 1l)riur) € Proc(u). 
(In this context ^c" means “is stored in.") Note that each processor houses 
a contiguous portion of T. 

In the store-by-row scheme we regard z as a p-by-r matrix 


Irxp = [ x(x) 2{(r + 1:2r) 


Zpxr = [ 2(hp) z(p+1:2p) ee z{(r-1)}p+ Èn) J, 
and store each row in a processor, i.e., z(u:p:;) € Proc(u). Store-by-row is 
sometimes referred to as the wrap method of distributing a vector because 
the components of + can be thought of as cards in a deck that are “dealt” 
to the processors in wrap-around fashion. 

If n is not an exact multiple of p, then these ideas go through with minor 
modification. Consider store-by-column with n = 14 and p = 4: 


T 
T = |T T2 T3 T4 | T5 Ta T7 38 | Fo Tio 311 | 212713 Tuj. 
Proc(1) Proc(2) Proc(3) Proc(4) 


In general, if n = pr + q with 0 < q < p, then Proc(1),...,Prac(g) can 
each house r + 1 components and Proc(g + 1),..., Proc(p) can house r 
components. In store-by-row we simply let Proc(p) house z(p:p:n). 

Similar options apply to the layout of & matrix. There are four obvious 
possibilities if A c E'*" and (for simplicity) n = rp: 


Atel + (a Dur) 
Column 


[Row {Wrap | Apm) ——— | 
These strategies have block analogs. For example, i£ A = [Ai,...,An] is 


a block column partitioning, then we could arrange to have Proc(j1) store 
A; for i = ueN. 
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6.1.4 Gaxpy on a Ring 


We are now set to develop a ring algorithm for the gaxpy z = y + Az 
(A € R'*^, z,y € R^). For clarity, assume that n = rp where p is the size 
of the ring. Partition the gaxpy as 


B 


z y Au co Alp 
> f= fi f+ : ; De (8.1.2) 
Zp 9p Ap. c0 App 


Ë 


where 4; € E'"" and z;, Yi, zi € R”. We assume that at the start of com- 
putation Proc(u) houses r,, y,, and the uth block row of A. Upon com- 
pletion we set as our goal the overwriting of y, by z,. From the Proc(u) 
perspective, the computation of 


P 
Zn = Ua + Y Asz- 


rm 


involves local data (Ayr, Yu: Tu) and nonlocal data (z-, T Æ u). To make 
the nonlocal portions of z available, we circulate its subvectors around the 
ring. For example, in the p = 3 case we rotate the 7), 72, and 73 as follows: 


Film [m | 2 | 


When a subvector of r "visits", the host processor must incorporate the 
appropriate term into its running sum: 


[step | Proc(1) — | — Proc(2) | Proc(3) | 
|.2 | n =y + Aura | ya = yo + Ants | ys = ya + Asiza | 
[3 fm =y + Ants | yo = ya + Anta | vs = Vs + Asses | 


In general, the “merry-go-round” of z subvectors makes p “stopa.” For each 
received r-subvector, a processor performs an r-by-r gaxpy. 


Algorithm 6.1.1 Suppose A c RR", xz c R”, and y € R” are given and 
that z = y + Az. If each processor in a p-processor ring executes the 
following node program and n = rp, then upon completion Proc(u) houses 
2(1+(p—1)}r:ur) in yioc. Assume the following local memory initializations: 
p, p (the node id), left and right (the neighbor id's), n, row = 1--(u—1)r:pr, 
Apoc = A(row,:), Tio = z(row), Yio = y(row). 
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for t = lip 

send (Zio, right) 

recv(zioc, left) 

T-pu-ti 

if; <0 

T=T+p 

end 

{ xs = 2(1 + (7 - Dr) } 

toc = Yoc + Atoc(:, 1 + (T — I)r:rr)zio: 
end 


The index r names the currently available z subvector. Once it is com- 
puted it is possible to carry out the update of the locally housed portion of 
y. The send-recv pair passes the currently housed r subvector to the right 
and waits to receive the next one from the left. Synchronization i3 achieved 
because the local y update cannot begin until the “new” z subvector ar- 
rives. It is impossible for one processor to "race ahead" of the others or for 
an 2 subvector to pass another in the merry-go-round. The algorithm is 
tailored to the ring topology in that only nearest neighbor communication 
is involved. The computation is also perfectly load balanced meaning that 
each processor has the same amount of computation and communication. 
Load imbalance is discussed further in $6.1.7. 

The design of a parallel program involves subtleties that do not arise in 
the uniprocessor setting. For example, if we inadvertently reverae the order 
of the send and the recv, then each processor starts its node program by 
waiting for a message from its left neighbor. Since that neighbor in turn is 
waiting for & message from its left neighbor, a state of deadlock results. 


6.1.5 The Cost of Communication 


Communication overheads can be estimated if we model the cost of sending 
and receiving a message. To that end we assume that a send or recv 
involving m floating point numbers requires 


T(m) = ag + Bam (6.1.3) 


seconds to carry out. Here ag is the time required to initiate the send or 
recv and ĝa is the reciprocal of the rate that a message can be transferred. 
Note that this model does not take into consideration the “distance” be- 
tween the sender and receiver. Clearly, it takes longer to pass a message 
halfway around a ring than to a neighbor. That is why it is always desirable 
to arrange (if possible) a distributed computation so that communication 
is just between neighbors. 

During each step in Algorithm 6.1.1 an r-vector is sent. and received and 
2r? flops are performed. If the computation proceeds at R flops per second 
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and there is no idle waiting associated with the recv, then each yr, update 
requires approximately (2r7/R) + 2(a + Ger) seconds. 

Another instructive statistic is the computation-to-communication ratio. 
For Algorithm 6.1.1 this is prescribed by 


Time spent computing x 2r?/R 
Time spent communicating ^ 2(cg4-fur) 


This fraction quantifies the overhead of communication relative to the vol- 
urne of computation. Clearly, as r — n/p grows, the fraction of time spent 
computing increases.) 


6.1.8 Efficiency and Speed-Up 
The efficiency of a p-processor parallel algorithm is given by 
. Ta) 
pT(p) 
where T(k) is the time required to execute the program on k processors. 


If computation proceeds at X flops/sec and communication is modeled by 
(8.1.3), then a reasonable estimate of T(k) for Algorithm 6.1.1 is given by 


E 


k 
T() = Y 2/B*/R (oa + Babn/k)) = 25 + Raak + 28an 
t=1 


for k > 1. This assumes no idle waiting. If k — 1, then no communication 
is required and T(1) = 2n7/R. It follows that the efficiency 


1 
1+ PE (aB +8) 


improves with increasing n and degradates with increasing p or R. In 
practice, benchmarking is the only dependable way to assess efficiency. 

A concept related to efficiency is speed-up. We say that a parallel algo- 
rithm for a particular problem achieves speed-up S if 


S = Taeq/Tpar 


where Tpar is the time required for execution of the parallel program and 
Treg is the time required by one processor when the best uniprocessor pro- 
cedure is used. For some problems, the fastest sequential algorithm does 
not parallelize and so two distinct algorithms are involved in the speed-up 
assessment. 

~~ 'We mention that these simple measures are not particularly illuminating in systems 
where the nodes are able to overlap computation and communication. 
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6.1.7 "The Challenge of Load Balancing 


If we apply Algorithm 6.1.1 to a matrix A c K**" that is lower triangular, 
then approximately half of the flops associated with the yo: updates are 
unnecessary because half of the 4;; in (6.1.2) are zero. In particular, in the 
pth processor, Ajoe(:,1 + (r — 1)r:rr) is zero if r > u- Thus, if we guard 
the tio, update as follows, 


Mrzu 
Woe = Ytoc  Atoc(, 1 + (T — l)r:rr)xie 
end 


then the overall number of flops is halved. This solves the superfluous flops 
problem but it creates a load imbalance problem. Proc(j:) oversees about 
pr? /2 flops, an increasing function of the processor id u. Consider the 
following r = p = 3 example: 


ži a 0 6/0 0 O/0 0 0 T; vi 
z2 & a 0jJ0 0 0[0 0 0 I9 ya 
Z3 a aajo 0 0|O 0 0 I3 ya 
EH 6 8 8|B8 G0 970 0 O T4 m 
z|-lB8 6 BlB B O10 O 0 ty | +f vs 
2 B B Bie B B|O 0 0|] z e 
27 T y T17 7 T17 0 0 Ir yr 
Za T * T| Y * 1|* 1 Tg Ya 
zo YY VY VY VT 31 3 Ig yo 


Here, Proc(1) handles the a part, Proc(2) handles the B part, and Proc(3) 
handles the y part. 

However, if processors 1, 2, and 3 compute (21, z4, 27), (22, 25, za), and 
(23, ze, 29}, respectively, then approximate load balancing results: 


zu 0 0/0 o 0!0 0 0 Ii uv. 
Z4 B 8iB 0 0j(0 0 90 T3 Y4 
27 T TY|Y Y iy 0 0 £g dm 
z2 a 070 0 0j0 0 0 T4 ya 
28 | = THREE 0 zs | + | vs 
ža T T|T 7 7j7 7 9 Te Ns 
Z3 a aj/O 0 0j0 0 0 Ir Ys 
E: B PIP B B10 O 0 Ts Ya 
% T TÍ* T TEV Y 7 Zo yo 


The amount of arithmetic still increases with 4, but the effect is not no- 
ticeable if n $ p. 

The development of the general algorithm requires some index manip- 
ulation. Assume that Proc(u) is initialized with Ap, = A(g:n,:) and 
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Woe = y(upen), and assume that the contiguous z-subvectors circulate as 
before. If at some stage zi, contains x(1 + (7 — I)r:rr), then the update 
Voc = Moc + Aioc(:,1 + (7 — l)rirr)mio- 
implements 
y(uepen) = y(pcpin) + A(pep:n, L + (7 — l)r:rr)z(1 + (7 — 1)rz77). 


To exploit the triangular structure of A in the y,. computation, we express 
the gaxpy as a double loop: 


for a = lir 
for B = Lr 
Viec() = Yoel} + Arola, B + (T — 1)r)mec(8) 
end 
end 


The Atoc reference refers to A(u+(a—1)p, 8+{7 — 1)r) which is zero unless 
the column index is less than or equal to the row index. Abbreviating the 
inner loop range with this in mind we cbtain 


Algorithm 6.1.2 Suppose 4 € IR"*^, x € R” and y € IR" are given and 
that z = y+ Az. Assume that n = rp and that A is lower triangular. If 
each processor in a p-processor ring executes the following node program, 
then upon completion Proc(j) houses z(u:p:t) in Yigg. Assume the following 
local memory initializations: p, u (the node id), left and right (the neighbor 
id's), n, Atoe = Agen, :), Moe = y(u:pin), and tice = z(1 + (p — Lrspr). 
r=n/p 
for t = l:p 
send (Zige, right) 
recv(Ztoc, left) 
To=p-t 
ifr <0 
T=T+pP 
end 
{Tro = z(1- (T - 1)rr)) 
for a = br 
for 8 = iii (a — l)p — (T —1)r 
Ytoc() = ytoe(a) + Atec(a, B+ (7 — 1)r)ymiee() 
end 
end 
end 


Having to map indices back and forth between “node space” and “global 
space” is one aspect of distributed matrix computations that requires care 
and (hopefully) compiler assistance. 
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6.1.8 Tradeoffs 


As we did in §1.1, let us develop a column-oriented gaxpy and anticipate 
its performance. With the block column partitioning 


A = [A},..., Ap] A; eR" r — n/p 
the gaxpy z = y + Az becomes 


p 
z=y+ > Aytp 
pæl 
where z, = z(l-F(u — Y)r:gr). Assume that Proc(y) contains A, and z,. 
Its contribution to the gaxpy is the product A £, and involves local data. 
However, these products must be summed. We assign this task to Proc(1) 
which we assume contains y. The strategy is thus for each processor to 
compute Á,z,, and to send the result to Proc(1). 


Algorithm 6.1.3 Suppose A c IR?*"^, z € R” and y € R are given and 
that z = y + Az. If each processor in a p-processor network executes the 
following node program and n = rp, then upon completion Proc(1) houses 
z. Assume the following local memory initializations: p, u (the node id), 
R, Zio; = 2(1+ (p — 1)rur), Aie = AC, 1 + (a — Dri), and (in Proc(1) 
only) toc = y. 
ifu-l 
Ylse = Woe + Aloctloc 
for t = 2ip 
PEC (ioc, t) 
Vloc = Yloc + Wie 
end 
else 
Wise = ÁtoeTloc 
send (wie, 1) 
end 


At first glance this seems to be much less attractive than the row-oriented 
Algorithm 6.1.1. The additional responsibilities of Proc(I) mean that it 
has more arithmetic to perform by a factor of about 


2n'/p + np _ P^ 
9ni/p ——— 2n 


and more messages to process by a factor of about p. This imbalance be- 
comes less critical if n >> p and the communication parameters ag and fa 
factors are small enough. Another possible mitigating factor is that Algo- 
rithm 6.1.3 manipulates length n vectors whereas Algorithm 6.1.1 works 
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with length n/p vectors. If the nodes are capable of vector arithmetic, then 
the longer vectors may raise the level of performance. 

This brief comparison of Algorithms 6.1.1 and 6.1.3 reminds us once 
again that different implementations of the same computation can have 
very different performance characteristics. 


6.1.9 Shared Memory Systems 


We now discuss the gaxpy problem for a shared memory multiprocessor. In 
this environment each processor has access to & common, global memory 
as depicted in Figure 6.1.2. Communication between processors is achieved 


Global Memory 


FIGURE 6.1.2 A Four-Processor Shared Memory System 


by reading and writing to global variables that reside in the global memory. 
Each processor executes its own local program and has its own local memory. 
Data flows to and from the global memory during execution. 


All the concerns that attend distributed memory computation are with 
us in modified form. The overall procedure should be load balanced and the 
computations should be arranged so that the individuel processors have 
to wait as little as possible for something useful to compute. The traffic 
between the global and local memories must be managed carefully, because 
the extent of such data transfers is typically a significant overhesd. (It 
corresponds to interprocessor communication in the distributed memory 
setting and to data motion up and down a memory hierarchy as discussed 
in 81.4.5.) The nature of the physical connection between the processors 
and the shared memory is very important and can effect algorithmic devel- 
opment. However, for simplicity we regard this aspect of the system as a 
black box as shown in Figure 6.1.2. 
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6.1.10 A Shared Memory Gaxpy 

Consider the following partitioning of the n-by-n gaxpy problem z = y+ Az: 
zı yı Ay 
> f=]: fe]: fz. (6.1.4) 
žy Yp Ap 

Here we assume that n = rp and that A, € R™", y, € R', and z, € RF. 

We use the following algorithm to introduce the basic ideas and notations. 


Algorithm 6.1.4 Suppose 4 c EC", x c R^, and y € R” reside in a 
global memory accessible to p processors. If n = rp and each processor 
executes the following algorithm, then upon completion, y is overwritten 
by z = y + Ar. Assume the following initializations in each local memory: 
p, p (the node id), and n. 


r=n/p 
row = 1 + (4 - L)riyr 
Ti = I 
Yoe = y(row) 
for j = imn 
Mow = A(row, j) 
loc = Wor + MocLoclI) 
end 
ylrow) = Yise 


We assume that a copy of this program resides in each processor. Float- 
ing point variables that are local to an individual processor have a "loc" 
subscript. 

Data is transferred to and from the global memory during the execution 
of Algorithm 6.1.4. There are two global memory reads before the loop 
(toc = x and yr = y(row)), one read each time through the loop (ai. = 
A(row, j)), and one write after the loop (y(row) = ytes). 

Only one processor writes to a given global memory location in y, and 
so there is no need to synchronize the participating processors. Each has 
a completely independent part of the overall gaxpy operation and does not 
have to monitor the progress of the other processors. The computation is 
statically scheduled because the partitioning of work is determined before 
execution. 

If A is lower triangular, then steps have to be taken to preserve the 
load balancing in Algorithm 6.1.4. As we discovered in $6.1.7, the wrap 
mapping is a vehicle for doing this. Assigning Proc(u) the computation of 
z(uxkn) = y(ucp:n) + A(up:n, :)z effectively partitions the n? flops among 
the p processors. 
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6.1.11 Memory Traffic Overhead 


It is important to recognize that overall performance depends strongly on 
the overheads associated with the reads and writes to the global memory. 
If such a data transfer involves m floating point numbers, then we model 
the transfer time by 
T(m) = a, + fem. (6.1.5) 

The parameter œ, represents a start-up overhead and f, is the reciprocal 
transfer rate. We modelled interprocessor communication in the distributed 
environment exactly the same way. (See (6.1.3).) 

Accounting for all the shared memory reads and writes in Algorithm 
6.1.4 we see that each processor spends time 


n? 
T x (n 3)o, + P^ 


communicating with global memory. 

We organized the computation so that one column of A(row.:) is read 
at a time from shared memory. Jf the local memory is large enough, then 
the Joop in Algorithm 6.1.4 can be replaced with 

Atoc = A(row,:) 
Vioc = Yoe + ÁtocTtoc 
This changes the communication overhead to 


2 
= n 
Tam 3a, t — fy, 
P 
a significant improvement if the start-up parameter a, is large. 


6.1.12 Barrier Synchronization 


Let us consider the shared memory version of Algorithm 6.1.4 in which 
the gaxpy is column oriented. Assume n = rp and col = 1 + (p — 1)ripr. 
A reasonable idea is to use a global array W(1:n, 1:p) to house the prod- 
ucts A(:, col)z(col) produced by each processor, and then have some chosen 
processor (say Proc(1)) add its columns: 


Aoc = Al:, col); Ttoc = z(col); Wioc = Atocttoc; W(: H) = wise 


fol 
Vioc = Y 
for j = lp 
wres = W(:, j) 
Yoc = Ylse + Woe 
end 
y = Moc 
end 
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However, this strategy is seriously flawed because there is no guarantee that 
W(i:n, 1:p) is fully initialized when Proc(1) begins the summation process. 

What we need is a aynchronization construct that can delay the Proc(1) 
summation until all the processors have computed and stored their contri- 
butions in the W array. For this purpose many shared memory systems 
support some version of the barrier construct which we introduce in the 


following algorithm: 


Algorithm 6.1.5 Suppose A c E^""^, rc E^, and y € R^ reside in a 
global memory accessible to p processors. If n = rp and each processor 
executes the following algorithm, then upon completion y is overwritten by 
y + Az. Assume the following initializations in each local memory: p, 4 
(the node id), and n. 


r = n/p; col = 1 + (u — l)rur; Atoe = Al:, col); Zioc = z(col) 
Wise = AlocTloc 
Wu) = toe 
barrier 
ifp=1 
Yioc = Y 
for j = Lp 
Wtoc = W(:, 3) 
Vioc = Woe + Woe 
end 
y = Vioc 
end 


To understand the barrier, it is convenient to regard a processor as either 
blocked or free. A processor is blocked and suspends execution when it 
executes the barrier. After the pth processor is blocked, all the processors 
return to the “free state” and resume execution. Think of the barrier as 
treacherous stream to be traversed by all p processors. For safety, they 
all congregate on the bank before attempting to cross. When the last 
member of the party arrives, they ford the stream in unison and resume 
their individual treks. 

In Algorithm 6.1.5, the processors are blocked after computing their 
portion of the matrix-vector product. We cannot predict the order in which 
these blockings occur, but once the last processor reaches the barrier, they 
are all released and Proc(1) can carry out the vector summation. 


6.1.43 Dynamic Scheduling 


Instead of having one processor in charge of the vector summation, it is 
tempting to have each processor add its contribution directly to the global 
variable y. For Proc(), this means executing the following: 
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r 2 n/p; col — 1- (n — l)r:pr; Aoc = A(:, col); zioe = z(col) 
Woe = ÁlocLlos 
Moc = Y; Voc = Vioc + Moc) Y = Noe 


However, a problem concerns the read-update-write triplet 
Moe = Yi Noc = Wor + Wloci Y = Noc 


Indeed, if more than one processor is executing this code fragment at the 
same time, then there may be a loss of information. Consider the following 
sequence: 


Proc(1) reads y 
Proc(2) reads y 
Proc(1) writes y 
Proc(2) writes y 


The contribution of Proc(1) is lost because Proc(1) and Proc(2) obtain the 
same version of y. As a result, the effect of the Proc(1) write is erased by 
the Proc(2) write. 

To prevent this kind of thing from happening most shared memory 
systems support the idea of a critical section. These are special, isolated 
portions of a node program that require a “key” to enter. Throughout the 
system, there is only one key and so the net effect is that only one processor 
can be executing in a critical section at any given time. 


Algorithm 6.1.6 Suppose A c EP"", re R^, and y € R" reside in a 
global memory accessible to p processors. If n = pr and each processor 
executes the following algorithm, then upon completion, y is overwritten 
by y+ Ar. Assume the following initializations in each local memory: p, js 
{the node id), and n. 


r = n/p; col = 1 + (a — Uripr; Atoc = A(: col); zie = z(col) 
Uoc = Alocfioe 
begin critica] section 
Vioc = Y 
Vloc = Yloc + Woe 
Y = Jioc 
end critical section 


This use of the critical section concept controls the update of y in a way 
that ensures correctness. The algorithm is dynamically scheduled because 
the order in which the summations occur is determined as the computation 
unfolds. Dynamic scheduling is very important in problems with irregular 
structure. 
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Problems 


P6.1.1 Modify Algorithm 6.1.1 so that it can handle arbitrary n. 

P9.1.3 Modify Algorithm 6.1.2 so that it efficiently handles the upper triangular cape. 
P6.1.3 (a) Modify Algorithms 6.1.3 and 6.1.4 so that they overwrite y with z = y+ A™z 
for a given positive integer m that is available to each processor. (b) Modify Algorithms 
6.1.3 and 6.1.4 so that y is overwritten by z = y + AT Az. 

P6.1.4 Modify Algorithm 6.1.3 so that upon completion, the local array Ajo, in Proc(n) 
houses the pth block column of A+ zy”. 


P8.1.5 Modify Algorithm 6.1.4 so that (4) A is averwritten by the outer product update 
A+ ay", (b) z ia overwritten with A?z, (c) y is overwritten by a unit 2-norm vector in 
the direction of y+ A*z, and (d) it efficiently handles the case when A is lower triangular. 
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netica 6, 28-40. 
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J.J. Dongarra and D.C. Sorensen (1986). “Linear Algebra on High Performance Com- 
puters,” Appi. Math. and Comp. £0, 57-88. 

K.A. Gallivan, RJ. Plemmons, and A.H. Sameh (1990). “Parallel Algorithms for Dense 
Linear Algebra Computations,” SIAM Review 32, 54-135. 
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Algebra,” in Acta Numerica 1993, Cambridge University Press. 
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L. Adams and T. Crockett (1984). “Modeling Algorithm Execution Time on Processor 
Arrays,” Computer 17, 38-43. 

D. Gannon and J. Van Rosendale (1984). “On the Impact of Communication Complexity 
on the Design of Parallel Numerical Algorithms,” [EEE Trans. Comp. C-33, 1180- 
1194. 

S.L. Johnsson (1987). “Communication Efficient Basic Linear Algebra Computations on 
Hypercube Multiprocessors,” J. Parallel and Distributed Computing, No. 4, 133-172. 

Y. Saad and M. Schultz (1989). "Data Communication in Hypercubes,” J. Dist. Parallel 
Comp. 6, 115-135. 

Y. Saad and M.H. Schultz (1989). “Data Communication in Parallel Architectures,” J. 
Dist. Parallel Comp. 11, 131-150. 


For snapshots of basic linear algebra computation on a distributed memory system, see 


O. McBryan and E.F. van de Velde (1987). “Hypercube Algorithms and Implementa- 
tions," SIAM J. Sci. and Stat. Comp. 8, s227-s287. 

S.L. Johnsson and C. T. Ho (1988), “Matrix Transposition on Boolean n-cube Configured 
Ensemble Architectures,” SIAM J. Matriz Anal. Appi. 9, 419-454. 

T. Dehn, M. Eiermann, K. Giebermann, and V. Sperling (1995). “Structured Sparse 
Matrix Vector Multiplication on Massively Parallel SIMD Architectures," Paraliel 
Computing 21, 1887-1894. 

J. Choi, J.J, Dongarra, and D.W. Walker (1995). “Parallel Matrix Transpose Algorithms 
on Distributed Memory Concurrent Computers,” Parallel Computing 21, 1387-1406. 

L. Colombet, Ph. Michallon, and D. Trystram (1996). “Parallel Matrix- Vector Product 
on Rings with à Minimum of Communication,” Parallel Computing ££, 289-310. 


The implementation of a parallel algorithm is usually very challenging. It is important 
to have compilers and related tools that are able to handle the details. See 


D.P. O'Leary and G.W. Stewart (1986). “Assignment and Scheduling in Paraliel Matrix 
Factorization,” Lin. Alg. and Its Applic. 77, 275—300. 

J. Dongarra and D.C. Sorensen (1987). “A Portable Environment for Developing Parallel 
Programs,” Parnllel Computing 5, 175-186. 

K. Connolly, J.J. Dongarra, D. Sorensen, and J. Patterson (1988). "Programming 
Methodology and Performance Issues for Advanced Computer Architectures,” Par- 
allel Computing 5, 41-58. 

P. Jacobson, B. Kagstrom, and M. Rannar (1992). "Algorithm Development for Dis- 
tributed Memory Multicomputers Using Conlab," Scientific Programming, 1, 185- 


203. 

C. Ancourt, F. Coelbo, F. Irigoin, and R. Keryell (1993). “A Linear Aigebra Framework 
for Static HPF Code Distribution,” Proceedings of the dih Workshop on Comptiers 
for Parallel Computers, Delft, The Netherlands. 

D. Bau, 1. Kodukula, V. Kotlyar, K. Pingali, and P. Stodghill (1993). “Solving Alignment 
Using Elementary Linear Algebra,” in Proceedings of the Tth International Workshop 
on Languages and Compilers for Parallel Computing, Lecture Notes in Computer 
Science 892, Springer- Verlag, New York, 46-60. 

M. Wolfe (1996). High Performance Compilers for Parallel Computers, Addison- Wesley, 
Reading MA. 


292 CHAPTER 6. PARALLEL MATRIX COMPUTATIONS 


6.2 Matrix Multiplication 


In this section we develop two parallel algorithms for matrix-matrix multi- 
plication. A shared memory implementation is used to illustrate the effect 
of blocking on granularity and load balancing. A torus implementation is 
designed to convey the spirit of two-dimensional data flow. 


6.2.1 A Block Gaxpy Procedure 


Suppose 4, B, C € R°“* with B upper triangular and consider the compu- 
tation of the matrix multiply update 


D=C+AB (6.2.1) 


on a shared memory computer with p processors. Assume that n = rkp 
and partition the update 


[Di,..., Dg] - [€1.. Cep] + [Ai Akp] | Fr... Big] (6.2.2) 


where each block column has width r = n/(kp). If 


Bi; 
Bj = Bi , Bi € R, 
0 
then 4 
D; = C, * AB; = Cj + >> A Bos. (6.2.3) 
Tmi 


The number of flops required to compute D; is given by 


This is an increasing function of j because B is upper triangular. As we 
discovered in the previous section, the wrap mapping is the way to solve 
load imbalance problems that result from triangular matrix structure, This 
suggests that we assign Proc(j) the task of computing D; for j = u:p:kp. 


Algorithm 6.2.1 Suppose A, B, and C are n-by-n matrices that reside 
in a global memory accessible to p processors. Assume that B is upper 
triangular and n — rkp. If each processor executes the following algorithm, 
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then upon completion C is overwritten by D = C + AB. Assume the 
following initializations in each local memory: n, r, k, p and p (the node 
id). 
for j = p:p:kp 
(Compute D;.} 
Bio = B(1:jr, 1 + (j — 1)rjr) 
Cie = C(:,1 + (j — rer) 
forr-l1:j 
col = 1+ (T — i)ritr 
Alos = A(:, col) 
Cloe = Ctoc + Ajoc Bio (col, :) 
end 
C(:,1+ (7 - I)rjr) = Choe 
end 
Let us examine the degree of load balancing as a function of the parameter 
k. For Proc(u), the number of flops required is given by 


kp 2n? 
F(u) = Shoe ip * (n^ =) ae 


The quotient F(p)/ F(1) is a measure of load balancing from the flop point 
of view. Since 


F(p) _ kpt+kp/2 — t+ 2(p- 1) 

FO)  krkp2 ~ 24 kp 
we see that arithmetic balance improves with increasing k. A similar anal- 
ysis shows that the communication overheads are well balanced as k in- 
creases. 

On the other hand, the total number of global memory reads and writes 
associated with Algorithm 6.2.1 increases with the square of k. If the start- 
up parameter a, in (6.1.5) is large, then performance can degrade with 
increased k. 

The optimum choice for k given these two opposing forces is system 
dependent. If communication is fast, then smaller tasks can be supported 
without penalty and this makes it easier to achieve load balancing. A mul- 
tiprocessor with this attribute supporta fine-grained paralleBsm. However, 
if granularity is too fine in a system with high-performance nodes, then it 
may be impossible for the node programs to perform at level-2 or level-3 
speeds simply because there just is not enough local linear algebra. Again, 
benchmarking is the only way to clarify these issues. 


6.2.2 Torus 


A torus is a two-dimensional processor array in which each row and col- 
umn is a ring. See FIGURE 6.2.1. A Processor id in this context is an 
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ordered pair and each processor has four neighbors. In the displayed exam- 


a real o 


FIGURE 6.2.1 A Four-by-Four Torus 


ple, Proc(1,3) has west neighbor Proc(1,2), east neighbor Proc(1,4), south 
neighbor Proc(2,3), and north neighbor Proc(4,3). 

To show what it is like to organize a toroidal matrix computation, we 
develop an algorithm for the matrix multiplication D = C + AB where 
A,B,C € R”™”. Assume that the torus is pi-by-p and that n = rp. 
Regard A = (A,;), B = (Bij), and C = (Ci) as p1-by-p1 block matrices 
with r-by-r blocks. Assume that Proc(i, j) contains Aij, Bij, and Ci; and 
that its mission is to overwrite Ci; with 


m 
Diy = Cy + Y Ae Bey. 
k=l 


We develop the general algorithm from the p, = 3 case, displaying the torus 
in ceilular form as follows: 
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Let us focus attention on Proc(1,1) and the calculation of 


Dy = Cy + AunBn + AB + ABa. 


Suppose the six inputs that define this block dot product are positioned 
within the torus as follows: 


(Pay no attention to the "dots," They are later replaced by various Aij 
and B). 

Our plan is to “ratchet” the first block row of A and the first block 
column of B through Proc(1,1) in a coordinated fashion. The pairs Aii 
and Bj;, Àj; and B31, and A;3 and Bs; meet, are multiplied, and added 
into a running sum array Ci: 


Cioc = Cloe + Ara Bar 


Cioe = Cioe + Arg B31 


Cioe = Cioe + Ani Bia 
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Thus, after three steps, the local array Cio, in Proc(1,1) houses D. 

We have organized the flow of data so that the Aj migrate westwards 
and the B; migrate northwards through the torus. It is thus apparent that 
Proc(1,1) must execute a node program of the form: 


for t= 1:3 
send (Ajoc, west) 
send( Eis, north) 
recv( An, east) 
recv(Bigc, south) 
Cioc = Ctoc + Aloe Bto 
end 


The send-recv-send-recv sequence 


for t= 1:3 
send( Ajo, west) 
recv(Alge, east) 
send( Bis, north) 
recv( Bis, south) 
Cioe = Croc + AiocBtac 
end 


also works. However, this induces unnecessary delays into the process be- 
cause the B submatrix is not sent until the new A submatrix arrives. 

We next consider the activity in Proc(1,2), Proc(1,3), Proc(2,1), and 
Proc(3,1). At this point in the development, these processors merely heip 
circulate blocks 414, A12, and A13 and By, Ba, and Bai, respectively. If 
Boz, Biz, and B3; flowed through Proc(1,2) during these steps, then 


Di; = Cy + Aig Baa + An Big + ABa 
could be formed. Likewise, Proc(1,3) could compute 
Dia = Cig + Arr Big + Ar B93 + Ais Bas 


if By3, Baa, and Bg3 are available during t = 1:3. To this end we initialize 
the torus as followa 


With northward flow of the Bi; we get 
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t=1 
t=2 
t=3 


Thus, if B is mapped onto the torus in a “staggered start” fashion, we can 
arrange for the first row of processors to compute the first row of C. 

If we stagger the second and third rows of A in a similar fashion, then 
we can arrange for all nine proceasors to perform a multiply-add at each 
step. In particular, if we set 


Ay 


then with westward flow of the A;; and northward flow of the Bi; we obtain 
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From this example we are ready to specify the general algorithm. We 
assume that at the start, Proc(i, j) houses Aij, Bij, and C,;. To obtain the 
necessary staggering of the A data, we note that in processor row i the A,, 
should be circulated westward i — 1 positions. Likewise, in the jth column 
of processors, the B; should be circulated northward j — 1 positions. This 
gives the following algorithm: 


Algorithm 6.2.2 Suppose A € BR", B e R"*^, and C e R?*" are given 
and that D = C + AB. If each processor in a p,-by-pj torus executes 
the following algorithm and n = pır, then upon completion Proc(y, A) 
houses D,4 in local variable Cio. Assume the following local memory 
initializations: pi, (x, À) (the node id), north, east, south, and west, (the 
four neighbor ifs), row = 1 + (p —- l)r:ur, col = 1 + (A — Briar, Alae = 
A(rotw, col), Bio; = B(row, col), and Cioe = C(row, coi). 


{Stagger the A,; and Ba. } 
fork=ly-1 
send( Auc, west); recv( Aja, east) 


end 
for k=1:A\-1 
send{ Bigg, north); recv( Bioc, south) 
end 
for k = Epi 


Cioe = Cioe + Atoc Bloe 
send( Aj, west) 
send( Bios: north) 
recv( Aic, east) 
recv( Br, south) 

end 
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{Unstagger the À,; and Ba.) 
for k=l; -1 
send(Ajee, east); recv( Atoe, west) 


end 
for k 2 1:3—-1 

send(B,., south); recv( Bio north) 
end 


It is not hard to show that the computation-to-communication ratio for 
this algorithm goes to zero as n/p, increases. 


Probiems 


P6.2.1 Develop a ring implementation for Algorithm 6.2.1. 


P8.2.2 An upper triangular matrix can be overwritten with its square without any 
additional workspace. Write a dynamically scheduled, shared-memory procedure for 
doing this. 
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6.3 Factorizations 


In this section we present a pair of parallel Cholesky factarizations. To 
illustrate what a distributed memory factorization locks like, we implement 
the gaxpy Cholesky algorithm on a ring. A shared memory implementation 
of outer product Cholesky is also detailed. 


6.3.1 A Ring Cholesky 


Let us see how the Cholesky factorization procedure can be distributed on 
a ring of p processors. The starting point is the equation 


unl 


Gu, Gen p) = Alun, u) — 35 Glu 3)8(in, 3) = u(pin}. 


j=l 


This equation is obtained by equating the pth column in the n-by-n equa- 
tion A = GGT. Once the vector v(u:n) is found then G(y:n, 4) is a simple 


scaling: 

G(un, i) = v(un)/ fol). 
For clarity, we first assume that n = p and that Proc(ju) initially houses 
A(u:n,u). Upon completion, each processor overwrites its A-column with 
the corresponding G-column. For Proc() this process involves u — 1 saxpy 
updates of the form 


A(un, p) — Aln, u) — GU j)G (uim, 3) 


followed by a square root and a scaling. The general structure of Proc(uy's 
node program is therefore as follows: 


for j =1l:ip—1 
Receive a G-column from the left neighbor. 
I£ necessary, send a copy of the received G-column to 
the right neighbor. 
Update A(u:n, x) - 
end 
Generate G(p:n, p) and, if necessary, send it to the 
right neighbor. 


Thus Proc(1) immediately computes G(1:n,1) = A(1:n, 1)/,/ A(1, 1) and 
sends it to Proc(2). As soon as Proc(2) receives this column it can generate 
G(2:n, 2) and pass it to Proc(3) etc.. With this pipelining arrangement we 
can assert that once a processor computes its G-column, it can quit. It 
also follows that each processor receives G-columna in ascending order, i.e., 
G(I:n, 1), G(2:n, 2), etc. Based on these observations we have 
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= 
while j <p 
PEcv(Goc(7:71), left) 
ifucn 
send(gioc(7:n), right) 
end 
Aioc(u:n) = Atoc(ucn) — guoc(u)g(un) 
j=j+1 
end 
Aioc(u:n) = Atoc (un) / v Aitoe(H4) 
ifa cn 
a send( Ar (un), right) 


Note that the number of received G-columns is given by j — 1. If j = p, 
then it is time for Proc() to generate and send G(u:n, 1). 

We now extend this strategy to the general n case. There are two obvi- 
ous ways to distribute the computation. We could require each processor 
to compute a contiguous set of G-columns. For example, if n = 11, p = 3, 
and A = [a1,...,011], then we could distribute A as follows 


[412 23 a4 | a ae az as | a9 aro 011] . 
LÁ u—— ee Se n —— 
Proc(1) Proc(2) Proc(3) 


Each processor could then proceed to find the corresponding G columns. 
The trouble with this approach is that (for example) Proc(1) is idle after 
the fourth column of G is found even though much work remains. 

Greater load balancing results if we distribute the computational tasks 
using the wrap mapping, i.e., 


[ a1 a4 a7 a10 | az as ae a1; | a3 as aa ] . 
mA Nanenane Sa pier 
Proc{1) Prec(1) Proc(3) 


In this scheme Proc(j) carries out the construction of G(:, j;p:n).. When 
a given processor finishes computing its G-columns, each of the other pro- 
cessors has at most one more G column to find. Thus if n/p > 1, then all 
of the processors are busy most of the time. 

Let us examine the details of a wrap-distributed Cholesky procedure. 
Each processor maintains a pair of counters. The counter j is the in- 
dex of the next G-column to be received by Proc(u). A processor also 
needs to know the index of the next G-column that it is to produce. Note 
that if col. = xn, then Proc(u) is responsible for G(:,col) and that 
L = length(col) is the number of the G-columns that it must compute. 
We use q to indicate the status of G-column production. At any instant, 
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eol (q) is the index of the next G-column to be produced. 


Algorithm 6.3.1 Suppose A € R"*" is symmetric and positive defi- 
nite and that A = GGT is its Cholesky factorization. If each node in 
a p-processor ring executes the following program, then upon completion 
Proc(u) houses G(k:n, k) for k = uipin in a local array Ajoc(1:n, D) where 
L = length(col) and col = pepin. In particular, G(col(q):n, col(g)) is 
housed in Ar (col(q):n, q) for g = 1:L. Assume the following local memory 
initializations: p, x (the node id), left and right (the neighbor id's), n, and 
At, = A(uipm,:). 


j= 1; q = l; cd = un; L = length(col) 
whileg < L 
if j = col(q) 

{ Form G(5:n, 7) ) 

Atoc(: g) = Ato (370. 4)/ v Atoch à) 

ifjcn 
send( Ais -(7:n, q), right) 

end 

j=j+1 

{ Update local columns. } 

for k=qg4+ LEL 
r = col (k) 
Auoc(r:n, k) = Aroclrin, k) — Auc(r d) Atoe (7, q) 

end 

q=q+1 

else 

recv(gioc(j:n), left) 

Compute a, the id of the processor that generated the 
received G-column. 

Compute @, the index of Proc(right)'s final column. 

if right fa ^j«]f 
send(gtoc(j:n), right) 


end 
{ Update local columns. } 
for k = qi: 
r = col(k) 
Aoc(r:n, k) = Apr: K) — aires (rin) 
end 
j=j+1 
end 


end 


To illustrate the logic of the pointer system we consider a sample 3-processor 
situation with n = 10. Assume that the three local values of q are 3,2, and 
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2 and that the corresponding values of col(g) are 7, 5, and 6 : 


l i i 
01 Q4 OT dijo ar a5 dga a 
[ | 5 u | a3 da ag] 
Proc(1) Proc(2) Proc(3) 


Proc(2) now generates the fifth G-column and increment its q to 3. 
The decision to pass & received G-column to the right neighbor needs 
to be explained. Two conditions must be fulfilled: 


* The right neighbor must not be the processor which generated the G 
column. This way the circulation of the received G-column is properly 
terminated. 


* The right neighbor must still have more G-columns to generate. Oth- 
erwise, a G-column will be sent to an inactive processor. 


This kind of reasoning is quite typical in distributed memory matrix com- 
putations. 

Let us examine the behavior of Algorithm 6.3.1 under the assumption 
that n > p. It is not hard to show that Proc(u) performs 


L 
Fly) = 9530 - (a (k Dp) + Up) = E 
kel P 


flops. Each processor receives and sends just about every G-column. Us- 


ing our communication overhead model (6.1.3), we see that the time each 
processor spends communicating is given by 


n 
m, = Y Haa + Ba(n — j)) = Youn + Ban? . 
j=l 
If we assume that computation proceeds at R flops per second, then the 
computation/communication ratio for Algorithm 6.3.1 is approximately 
given by (n/p)(1/3RG4). Thus, communication overheads diminish in im- 
portance as n/p grows. 


6.3.2 A Shared Memory Cholesky 


Next we consider a shared memory implementation of the outer product 
Cholesky algorithm: 


for k= l:n 
Alkin, k) = A(kin, &)/5/ A(k, k) 
for j=k+4 iin 


A(j:n, j) = A(j:n, j) — A(j:n, k) AG, k) 
end 
end 
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The j-loop oversees an outer product update. The n — k saxpy operations 
that make up its body are independent and easily parallelized. The scaling 
A(kin, k) can be carried out by a single processor with no threat to load 
balancing. 


Algorithm 6.3.2 Suppose A c E "" is a symmetric positive definite 
matrix stored in a shared memory accessible to p processors. If each pro- 
cessor executes the following algorithm, then upon completion the lower 
triangular part of A is overwritten with its Cholesky factor. Assume the 
following initializations. in each local memory: n, p and u (the node id). 
for k = 1: 
ifu-21 
Uloc(k:n) = A(K:n) 
Vioc(k:n) = tigc(K:r)/ V vioc(k) 
A(k:n, k) = ui (kn) 
end 
barrier 
Vtec(K + lin) = A(k + 1:n, k) 
for j = (k + p):p:n 
Wioe(jin) = A(j:n, j) 
WtoelF:N) = Woel) — toc) viec n) 
A(j:n, J) = WtoeG-n) 
end 
barrier 
end 
The scaling before the j-loop represents very little work compared to the 
outer product update and so it is reasonable to assign that portion of the 
computation to a single processor. Notice that two barrier statements are 
required. The first ensures that a processor does not begin working on the 
kth outer product update until the Ath column of G is made available by 
Proc(1). The second barrier prevents the processing of the k+1st step to 
begin until the kth step is completely finished. 


Problems 


P6.3.2 It is possible to formulate a block version of Algorithm 6.3.1. Suppose n = rN. 
For k = 1:N we (a) have Proc(1) generate G(:, 1 - (k— 1)r:kr) and (b) have all processors 
participate in the rank r update of the trailing submatrix A(kr--l:n, kr 41:1). See 84.2.6. 
The coarser granularity may improve performance if the individual processors like level-3 
operations. 

P€.3.2 Develop a shared memory QR factorization patterned after Algorithm 6.3.2. 
Proc(1) should generate the Householder vectors and all processors should share in the 
ensuing Householder update. 
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Chapter 7 


The Unsymmetric 
Eigenvalue Problem 


§7.1 Properties and Decompositions 

§7.2 Perturbation Theory 

$7.3 Power Iterations 

§7.4 The Hessenberg and Real Schur Forms 
$7.5 The Practical QR Algorithm 

$7.6 Invariant Subspace Computations 

$7.7 The QZ Method for Ax = ABx 


Having discussed linear equations and least squares, we now direct our 
attention to the third major probiem area in matrix computations, the 
algebraic eigenvalue problem. The unsymmetric problem is considered in 
this chapter and the more agreeabie symmetric case in the next. 

Our first task is to present the decompositions of Schur and Jordan 
along with the basic properties of eigenvalues and invariant subspaces. The 
contrasting behavior of these two decompositions sets the stage for §7.2 
in which we investigate how tbe eigenvalues and invariant subspaces of 
a matrix are affected by perturbation. Condition numbers are developed 
that permit estimation of the errors that can be expected to arise because 
of roundoff. 

The key algorithm of the chapter is the justly famous QR algorithm. 
This procedure is the most complex algorithm presented in this book and its 
development is spread over three sections. We derive the basic QR iteration 
in §7.3 as a natural generalization of the simple power method. The next 


